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AdaBoost  is  one  of  the  most  important  recent  developments  in  the  classi- 
fication methodology.  AdaBoost  works  by  repeatedly  applying  the  base  learning 
algorithm  to  the  re-sampled  versions  of  the  training  data  to  produce  a collection  of 
hypothesis  functions  which  are  finally  combined  via  a weighted  linear  vote  to  form 
the  final  decision.  Under  mild  assumptions,  AdaBoost  can  lead  to  a classification  al- 
gorithm with  arbitrary  accuracy.  By  pursuing  a large  norm-1  margin,  AdaBoost  can 
also  significantly  improve  the  generalization  performances  in  many  cases.  However, 
recent  studies  showed  that  AdaBoost  performs  poorly  on  noisy  data.  In  this  work 
we  present  several  new  regularized  boosting  algorithms  to  mitigate  the  overfitting 
problem  of  AdaBoost.  Our  regularized  algorithms  are  directly  motivated  by  the  con- 
nection between  AdaBoost  and  linear  programming.  They  implement  an  intuitive 
idea  of  controlling  the  distribution  skewness  in  the  learning  process  to  prevent  outlier 
samples  from  spoiling  decision  boundaries  by  introducing  a smooth  convex  penalty 
function  into  the  objective  function  of  the  minimax  problem.  Large-scale  experiments 
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based  on  UCI  (University  of  California,  Irvine),  DELVE  (Data  for  Evaluating  Learn- 
ing in  Valid  Experiments),  STATLOG  and  USES  (US  Postal  Service)  datasets  are 
conducted.  For  the  UCI,  DELVE  and  STATLOG  datasets,  we  show  that  our  regular- 
ized boosting  algorithms  can  achieve  at  least  the  same  or  much  better  performance 
than  other  regularized  AdaBoost  algorithms.  For  the  USPS  datasets,  we  show  that 
our  algorithms  are  very  robust  against  class  mislabeling  and  feature  noise.  We  also 
extend  our  analyses  to  multiclass  problems.  Particularly,  two  multiclass  AdaBoost 
algorithms:  AdaBoost. MO  and  AdaBoost. ECC  are  investigated.  We  prove  that  both 
algorithms  can  be  categorized  into  the  family  of  stagewise  functional  gradient  descent 
algorithms.  Based  on  the  different  margin  definitions,  two  new  regularized  multiclass 
AdaBoost  algorithms  are  also  proposed. 

We  also  consider  landmine  detection  via  forward-looking  ground  penetrating 
radar  (FLGPR)  by  using  time-frequency  analysis  and  AdaBoost.  Our  task  is  to 
detect  the  presence  of  landmines  in  radar  images.  We  formulate  it  as  an  object 
recognition  problem.  Two  main  challenges  are:  (1)  how  to  extract  intricate  structures 
of  target  signals  from  radar  imageries  and  (2)  how  to  adapt  a classifier  to  surrounding 
environments  through  learning.  To  address  the  first  challenge,  we  introduce  the 
time-frequency  analysis  into  the  landmine  detection  problem.  Several  time-frequency 
distributions  are  considered  to  characterize  and  interpret  scattering  phenomena  for 
both  targets  and  clutter.  Through  the  time-frequency  analysis,  we  find  that  the  most 
discriminant  information  is  time-frequency  localized.  Based  on  this  observation,  we 
propose  a wavelet  packet  transform  based  detector.  The  basic  idea  is  that  we  use 
the  overcomplete  wavelet  packet  transform  to  sparsely  represent  signals  with  the 
discriminant  information  encoded  into  several  bases  and  then  a feature  selection 
method  is  used  to  extract  these  components  and  thereby  a classifier  is  designed. 
Various  feature  selection  strategies  as  well  as  classification  schemes  are  investigated. 
However,  it  turns  out  that  the  conventional  classifier  design  procedure  cannot  fully 


address  the  problem  of  detecting  landmines  in  unconstrained  environments.  We  use 
the  AdaBoost  algorithm  to  further  improve  the  classification  performance.  Special 
efforts  are  put  on  the  control  of  two  conflicting  factors  of  overfitting  and  expression 
power  of  a classifier.  Experimental  results  based  on  measured  FLGPR  data  are 
presented  to  demonstrate  the  effectiveness  of  our  proposed  classifiers. 
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CHAPTER  1 
INTRODUCTION 


I spent  a large  part  of  my  Ph.D.  study  on  devising  automatic  target  recog- 
nition (ATR)  algorithms  to  detect  landmines  using  radar  images.  If  successful,  our 
algorithms  could  save  hundreds  and  thousands  of  lives  around  the  world.  Land- 
mine detection  is  a quite  challenging  problem,  including  many  difficulties  one  can 
encounter  when  designing  an  ATR  algorithm,  e.g.,  very  limited  mine  training  data, 
undefined  clutter  samples  (since  anything  other  than  mine  is  clutter)  and  the  need  to 
understand  underlying  physical  phenomena  to  extract  relevant  features.  In  fact,  the 
main  motivation  for  me  to  study  the  adaptive  boosting  (AdaBoost)  algorithm  is  to 
improve  landmine  detection  performances.  After  I went  deeper  into  this  area,  I was 
fascinated  by  the  idea  of  AdaBoost  as  well  as  ensemble  learning.  The  main  drawback 
of  AdaBoost  is  its  sensitiveness  to  noise  including  feature  noise  and  class  mislabelling, 
which  could  limit  its  applicability  to  some  applications  such  as  landmine  detection 
where,  due  to  low  signal-noise-ratios,  corrupted  features  and  class  mislabelling  are 
common.  In  this  work,  I mainly  focus  on  improving  the  robustness  of  AdaBoost 
against  noise  and  applying  the  AdaBoost  algorithm  to  landmine  detection.  There- 
fore, this  dissertation  consists  of  two  parts:  AdaBoost  and  regularized  AdaBoost, 
and  their  applications  to  landmine  detection. 

1.1  Introduction 

1.1.1  AdaBoost  and  Regularized  AdaBoost 

AdaBoost  [69],  [68],  [27]  is  considered  as  one  of  the  most  important  recent 
developments  in  the  classification  methodology  and  has  been  used  with  great  success 
in  many  applications  [71], [73], [53].  The  main  idea  of  AdaBoost  is  to  repeatedly  apply 
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the  base  learning  algorithm  to  the  re-sampled  versions  of  the  training  data  to  produce 
a collection  of  hypothesis  functions  which  are  finally  combined  via  a weighted  linear 
vote  to  form  the  final  decision.  An  intuitive  idea  in  AdaBoost  is  that  the  examples 
which  are  misclassified  get  more  weights  in  the  following  iterations  and  hence  the 
subsequent  claissifiers  focus  more  and  more  on  those  harder  cases,  for  instance,  the 
samples  near  decision  boundaries.  Under  a mild  assumption  that  the  base  learners 
can  achieve  an  error  rate  slightly  better  than  random  guess,  AdaBoost  can  reduce 
the  training  error  exponentially  with  respect  to  the  number  of  the  combined  base 
learners.  Moreover,  AdaBoost  can  also  improve  the  generalization  error  and  in  many 
cases,  the  generalization  error  continues  to  decrease  even  after  the  training  error 
becomes  zero. 

AdaBoost  pursues  a norm-1  margin  in  contrast  to  a norm-2  margin  in  support 
vector  machine  [15].  A large  margin  is  usually  conducive  to  good  generalization  in  the 
sense  that  if  a large  margin  can  be  achieved  with  respect  to  the  data,  an  upper  bound 
on  the  generalization  error  is  small.  In  the  early  development  of  AdaBoost,  due  to 
its  impressive  generalization  capability  in  many  applications,  some  researchers  con- 
jectured that  AdaBoost  was  immune  to  overfitting.  Several  large-scale  experiments 
[21], [4], [60]  based  on  artificial  and  real-world  datasets  have  been  conducted  to  com- 
pare AdaBoost  with  other  ensemble  algorithms  such  as  Bagging  and  to  investigate 
the  algorithm  behavior  of  AdaBoost.  It  was  shown  that  AdaBoost  in  many  cases 
can  significantly  improve  classification  performances.  However,  it  was  also  found 
that  AdaBoost  is  very  sensitive  to  noise.  The  main  reason  is  that  AdaBoost  puts 
too  much  algorithm  resources  (weights)  onto  several  hardest  examples  and  tries  to 
classify  them  indiscriminately  according  to  their  associated  labels  even  though  they 
might  be  mislabelled.  A natural  strategy  to  alleviate  this  problem  is  to  use  the  con- 
cept of  soft  margin.  One  typical  example  is  AdaBoostj^gg  [63],  which  is  one  of  the 
first  boosting-like  algorithms  that  achieved  the  state-of-the-art  generalization  results 
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on  noisy  data.  It  implemented  an  intuitive  idea  of  controlling  the  tradeoff  between 
the  margin  and  the  sample  influences  to  achieve  a soft  margin.  In  the  experimental 
evaluations  [63],  it  was  found  to  be  among  the  best  performing  ones.  However,  the 
problem  with  AdaBoostj^gg  is  that  it  is  difficult  to  analyze  the  underlying  optimiza- 
tion scheme  since  the  modification  is  done  on  the  algorithm  level  [49],  [62].  Another 
example  is  LPreg-AdaBoost  which  is  the  underlying  optimization  scheme  supporting 
v-kxc  [64]  and  C-Barrier  algorithm  [62].  LPreg-AdaBoost  exploits  the  connection 
between  AdaBoost  and  a Minimax  problem  and  introduces  slack  variables  into  an 
optimization  problem  in  the  primal  domain  in  the  same  way  as  support  vector  ma- 
chine does  towards  the  non-separable  data  case  to  achieve  a soft  margin  [15].  In 
the  dual  domain,  it  is  equivalent  to  constraining  the  distributions  to  a box,  which 
can  be  understood  as  adding  a penalty  of  0 within  the  box  and  oo  outside  the  box. 
Therefore,  this  scheme  is  somewhat  heuristic  and  may  be  too  restrictive  [48].  We 
instead  consider  controlling  the  distribution  skewness  by  adding  a convex  penalty 
function  to  the  objective  function  in  a minimax  problem  formulation.  To  solve  the 
resulting  piecewise  convex  optimization  problem,  the  column  generation  technique  is 
used  to  generate  the  gain  matrix  and  transform  the  convex  problem  into  a set  of  lin- 
ear programming  (LP)  problems.  Different  in  the  implementation,  we  propose  three 
regularized  AdaBoost  algorithms,  referred  to  as  AdaBoost > AdaBoost 
[76]  and  AdaBoost  [75].  In  AdaBoost AdaBoost AdaBoost 

is  used  as  a general  machine  of  a LP  solver  to  get  an  approximate  solution  while  in 
LPnorm2"AdaBoost,  a stabilized  column  generation  technique  is  used  to  achieve  the 
exact  solution.  To  demonstrate  the  effectiveness  of  the  proposed  algorithms,  a large- 
scale  experiment  based  on  the  UCI,  DELVE  and  STATLOG  datasets  is  conducted, 
showing  that  our  algorithms  can  achieve  at  least  the  same  or  much  better  perfor- 
mances than  other  regularized  boosting  algorithms.  We  also  make  an  investigation 
on  the  robustness  of  the  proposed  algorithms  against  the  class  mislabeling  error  and 
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feature  noise.  The  benchmark  USPS  handwritten  digit  dataset  is  used.  We  artifi- 
cially change  the  class  labels  or  add  some  feature  noise  to  the  image  samples.  The 
experimental  results  show  that  the  performance  of  AdaBoost  can  degrade  dramati- 
cally with  the  increase  of  noise  level  while  our  regularized  algorithms  demonstrate  a 
strong  noise  resistance  property. 

We  also  extend  our  derivations  to  multiclass  problems.  In  contrst  to  binary 
classifiers,  multiclass  classifiers  assign  class  labels  to  patterns  where  labels  are  drawn 
from  a finite  set  of  classes.  One  commonly  used  approach  is  to  first  transform  a 
multiclass  problem  into  a set  of  binary  class  problems  based  on  a code  matrix,  and 
then  combine  the  outputs  of  the  binary  classifiers  in  some  ways  to  form  final  deci- 
sions [34], [22],  [3].  Therefore,  in  addition  to  the  classifier  training,  it  also  involves  the 
design  of  a code  matrix.  According  to  Crame  and  Singer  [18],  the  approaches  to  mul- 
ticlass problems  can  be  summarized  into  three  categories:  (1)  given  a set  of  binary 
classifiers,  find  a code  matrix,  (2)  given  a code  matrix,  find  a set  of  binary  classifiers, 
and  (3)  find  a set  of  binary  classifiers  and  a code  matrix  simultaneously.  We  are 
not  interested  in  the  first  category.  Most  of  the  existing  algorithms  belong  to  the 
second  one.  The  main  drawback  of  these  approaches  is  that  a predefined  code  matrix 
fails  to  address  the  dependence  between  a code  matrix  and  the  class  of  hypothesis 
functions  used  to  construct  the  binary  classifiers  as  well  as  the  specific  application. 
In  this  sense,  finding  the  binary  classifiers  and  a code  matrix  simultaneously  appears 
to  be  the  best  choice.  Unfortunately,  this  problem  has  been  shown  to  be  NP-hard. 
A question  arises  naturally:  could  we  find  a way  which  overcomes  the  drawback 
of  a predefined  code  matrix  while  avoiding  the  need  to  solve  a NP-hard  problem? 
We  attack  multiclass  problems  from  the  viewpoint  of  AdaBoost.  Specifically,  in  the 
work,  we  consider  AdaBoost. MO  [69]  and  AdaBoost. ECC  [33].  We  first  prove  that 
both  algorithms  can  be  viewed  as  performing  functional  gradient  descent  procedure 
on  certain  cost  functions  based  on  different  margin  definitions.  AdaBoost. MO  uses 
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a predefined  code  matrix  while  AdaBoost.ECC  alternatively  generates  columns  of  a 
code  matrix  and  hypothesis  functions.  In  this  sense,  AdaBoost.ECC  can  be  consid- 
ered as  an  alternative  approach  to  exploiting  the  underlying  dependence  between  a 
code  matrix  and  the  hypothesis  functions  under  consideration.  Therefore,  some  in- 
depth  theoretical  studies  of  AdaBoost.ECC  is  warranted.  As  AdaBoost,  there  are  no 
regularization  terms  which  prevent  multiclass  AdaBoost  algorithms  from  overfitting 
and  very  little  work  has  been  done  to  improve  the  robustness  of  multiclass  AdaBoost 
against  noise.  We  extend  our  derivations  to  multiclass  problems  and  propose  two 
new  regularization  algorithms  based  on  AdaBoost. MO  and  AdaBoost.ECC. 

1.1.2  Landmine  Detection 

Landmines  are  causing  enormous  humanitarian  and  economic  problems  in 
many  countries  all  over  the  world.  Experts  estimate  that  up  to  110  million  landmines 
need  to  be  cleared  and  more  than  20, 000  civilians  are  killed  or  maimed  every  year  by 
landmines,  with  many  of  the  victims  being  children  [2].  However,  landmine  detection 
and  clearance  have  turned  out  to  be  extremely  challenging  problems.  At  the  current 
clearance  rate,  it  will  take  about  1, 000  years  to  remove  all  landmines  that  are  already 
placed  and  for  every  landmine  cleared,  a further  20  are  being  buried.  Therefore  it 
is  urgent  to  develop  a safe  and  cost  efficient  landmine  detection  system.  In  the  past 
fifteen  years,  various  techniques,  including  acoustic  sensor  [35],  infrared  technique, 
quadrapole  resonance  [43]  and  down-  and  forward-looking  ground  penetrating  radar 
(FLGPR)  [72],  [74],  have  been  investigated.  Among  them,  ground  penetrating  radar 
(GPR),  detecting  subground  targets  based  on  the  change  of  dielectric  permittivity 
rather  than  the  metal  content  of  the  targets  [11],  is  considered  as  a viable  technology 
for  landmine  detection.  There  are  basically  two  types  of  GPR  systems  currently 
under  investigation.  One  is  the  down-looking  GPR  system.  This  type  of  system 
has  its  antennas  placed  near  the  surface  of  the  earth.  Although  removing  the  strong 
signals  reflected  directly  from  the  ground  surface,  referred  to  as  the  ground  bounce 
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removal,  is  a challenging  problem  [80], [32],  this  type  of  system  shows  a very  promising 
detection  capability.  Its  main  drawback  is  that  it  is  time-consuming  to  use  this  type 
of  system  for  large  area  interrogation  and  the  short  standoff  distance  is  a problem  as 
well.  Therefore,  a forward-looking  system  is  desirable  in  many  applications. 

Detecting  the  presence  of  landmines  in  the  radar  images  can  in  general  be 
formulated  as  an  object  recognition  problem.  The  conventional  signal  detection  tech- 
niques such  as  the  matched  filter  method  may  not  be  applicable  here  since  it  is  very 
difficult,  if  not  impossible,  to  estimate  the  target  signatures  as  well  as  the  clutter 
statistical  properties  due  to  the  extremely  complicated  operating  environments.  One 
possible  method  to  overcome  this  problem  is  to  design  a detector  through  learning, 
which  is  often  used  in  the  problem  of  detecting  faces  or  pedestrians  in  highly  clut- 
tered scenes  [54],  [55],  [50].  However,  compared  to  the  conventional  object  recogni- 
tion problem,  landmine  detection  has  its  own  specific  challenges.  First,  in  contrast  to 
face  recognition  and  character  recognition  where  intensity  images  provide  us  with  an 
abundant  amount  of  human  recognizable  information,  radar  images  can  only  be  fully 
understood  through  the  analysis  of  radar  scattering  phenomena.  Based  on  the  object 
size,  physical  structures  and  composition  materials,  different  objects  react  to  incident 
radar  signals  differently  [13], [12], [42].  Hence,  how  to  quantify  these  differences  and 
thereby  design  a classifier  becomes  the  key  to  the  success  of  landmine  detection.  In 
the  context  of  pattern  classification,  this  process  is  referred  to  as  feature  extraction. 
In  our  case  of  a limited  number  of  training  samples  (in  particular  mine  samples),  be- 
ing able  to  extract  features  effectively  becomes  even  more  critical  since  by  designing 
a classifier,  we  in  essence  construct  a mapping  function  to  fit  the  training  data  and 
too  much  redundant  information  can  easily  lead  the  resulting  classifier  to  overfitting. 
Though  numerical  simulations  may  help  us  identify  this  useful  information,  given  the 
extremely  complicated  surveying  scenarios,  the  usefulness  of  simulated  information 
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is  very  limited  and  we  must  resort  to  a training  based  method  which  is  capable  of  au- 
tomatically extracting  the  intricate  structures  of  the  target  signals  through  learning. 
The  second  challenge  in  landmine  detection  is  that  in  contrast  to  the  case  of  pattern 
classification  where  we  only  need  to  decide  between  well-defined  classes,  a landmine 
detector  is  required  to  differentiate  mines  from  the  rest  of  the  world.  While  the  mine 
samples  are  well-defined,  there  are  no  typical  examples  for  clutter.  In  other  words, 
it  requires  us  to  design  a classifier  which  has  sufficient  expression  power  to  claim  a 
region  in  the  high  dimensional  feature  space  for  mines.  Note  that  this  requirement 
is  opposite  to  the  previous  process  of  feature  extraction.  On  the  one  hand,  we  want 
to  reduce  the  data  dimensionality  to  improve  the  classifier  generalization  capability 
over  unseen  samples;  on  the  other  hand,  the  classifier  should  have  sufficient  expres- 
sion power  to  attain  a low  training  error.  In  fact,  how  to  trade  off  the  influences  of 
the  aforementioned  two  concerns  is  an  essential  part  of  the  learning  theory.  As  stated 
in  [78],  the  learning  is  a process  of  choosing  an  appropriate  function  from  a given  set 
of  functions  based  on  the  training  data.  This  concept  will  be  adopted  throughout 
our  entire  landmine  detector  design  process  in  this  dissertation. 

A conventional  classifier  design  procedure  usually  consists  of  two  modules: 
feature  extraction  and  classifier  training.  Our  work  covers  every  aspect  of  designing 
a classifier.  To  start  the  work,  therefore,  the  first  important  thing  we  need  to  do  is  to 
investigate  what  kinds  of  features  we  can  extract.  To  this  end,  we  introduce  the  time- 
frequency  analysis  into  the  landmine  detection.  Several  time-frequency  distributions 
are  considered  to  characterize  and  interpret  the  scattering  phenomena  for  both  targets 
and  clutter.  In  particular,  the  Choi-Williams  distribution  is  found  to  be  suitable  for 
our  specific  application  of  landmine  detection.  Through  the  time-frequency  analysis, 
we  obtain  a graphical  understanding  on  how  different  objects  react  to  the  incident 
radar  signals  differently.  We  also  find  that  the  most  discriminant  information  between 
the  two  classes  is  time-frequency  localized.  Based  on  this  observation,  a wavelet 
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packet  transform  (WPT)  based  detector  is  proposed.  Wavelet  packet  transform,  with 
its  flexible  decomposition  capability,  is  used  to  encode  this  time-frequency  localized 
information  efficiently  into  several  bases  and  then  a feature  selection  method  is  used 
to  And  these  components  by  optimizing  a certain  cost  function.  With  the  selected 
features,  a classifier  can  be  designed.  Various  feature  selection  strategies  as  well  as 
classification  schemes  are  investigated.  With  the  use  of  our  proposed  WPT  based 
classifier,  we  conducted  two  blind  tests  in  August  and  October,  2003,  respectively. 
Our  results  scored  the  best  among  the  research  institutions  participating  in  the  blind 
test.  To  further  improve  the  detection  performance,  we  investigate  the  use  of  the 
boosting  method  for  landmine  detection.  In  particular,  the  AdaBoost  algorithm  is 
considered.  With  a neural  network  classifier  being  the  weak  learner,  the  AdaBoost  can 
significantly  improve  the  detection  performance.  We  also  propose  an  adaptive  feature 
selection  scheme  by  integrating  the  feature  selection  module  into  each  iteration  of 
AdaBoost.  It  is  shown  experimentally  that  this  scheme  can  effectively  extract  the 
discriminant  information  and  at  the  same  time  control  the  side  effect  of  overfitting. 

1.2  Organization  of  the  Dissertation 

This  dissertation  is  organized  as  follows.  In  Chapter  2,  we  give  a detailed 
review  on  the  AdaBoost  algorithm.  In  Chapter  3,  we  propose  two  regularized  Ad- 
aBoost algorithms,  referred  to  as  AdaBoost AdaBoost Chapter 

4,  a linear  programming  based  robust  boosting  algorithm  is  proposed.  In  Chapter 

5,  we  extend  our  derivations  to  multiclass  problems.  In  Chapter  6,  we  consider  de- 
tecting landmines  using  wavelet  packet  transform  and  AdaBoost.  Finally,  Chapter  7 
concludes  this  dissertation  with  our  summary  and  future  work. 


CHAPTER  2 
ADABOOST 

2.1  Preliminary  and  Notations 


In  this  dissertation  we  consider  pattern  classification  problems.  The  task  of 
pattern  classification  is  to  find  a rule,  which  is  learned  from  some  training  samples,  to 
assign  an  unseen  testing  sample  to  one  of  several  possible  class  labels.  The  problem 
can  be  mathematically  formulated  as  estimating  a mapping  function  f : X y 
where  X is  the  pattern  space  which  is  usually  a subspace  of  R"*  space  with  m being  the 
feature  dimensionality  and  y is  the  label  space.  In  the  dissertation  we  mainly  work 
on  a binary  classification  problem  where  T = {±1}.  In  Chapter  5,  we  will  extend  our 
discussions  to  multiclass  problems.  Given  a set  of  training  samples  V = {(x„,  yn)}n=i 
which  are  independently  and  identical  distributed  (i.i.d.)  samples  from  an  unknown 
but  fixed  distribution  p(x,  y)  and  a family  of  hypothesis  functions  our  goal  is  to 
find  a function  f ^ T such  that  the  generalization  error. 

R{f)  = J L{f{^),y)p{^,y)dy.dy  (2.1) 

is  minimized  where  L(/(x),y)  is  a loss  function  reflective  of  the  consequences  of 
deciding  /(x)  when  the  true  label  is  y.  In  a regression  problem,  the  square  loss 
L(/(x),y)  :=  (/(x)  — yY  is  commonly  used.  In  a classification  problem,  one  often 
uses  the  0/1  loss:  L(/(x),  y)  = I(/(x)  / y)  where  I(.4)  is  an  indicator  function  which 
is  equal  to  1 if  >1  holds  and  0 otherwise.  It  is  easy  to  show  that  under  the  0/1  loss 
R{f)  is  the  classification  error  of  /(x).  Since  the  underlying  distribution  p{x,y)  is 
unknown  and  we  only  have  a limited  number  of  training  data  at  our  disposal,  we 
cannot  determine  /(x)  by  directly  evaluating  Equation  (2.1).  One  possible  remedy 
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is  to  find  /(x)  by  minimizing  the  empirical  error. 

1 ^ 

= (2.2) 

n=l 

Unfortunately,  minimizing  the  empirical  error  cannot  guarantee  finding  a function 
performing  well  on  unseen  samples.  If  JF  is  too  small,  we  are  subject  to  underfitting 
since  T is  unlikely  to  contain  the  good  decision  rules  that  reflect  the  structures  of 
the  training  data;  on  the  other  hand,  if  T is  too  big,  the  function  found  through  the 
evaluation  of  Equation  (2.2)  can  deviate  from  the  optimal  one  drastically.  It  is  the 
so-called  overfitting  problem,  which  is  highly  related  to  the  curse  of  dimensionality 
in  statistical  pattern  recognition  and  model  selection  in  regression  problems.  We  can 
simply  augment  the  hypothesis  set  T (using  complicated  decision  rules)  to  deal  with 
the  underfitting  problem.  However,  it  is  usually  difficult  to  answer  the  following  ques- 
tion: how  big  is  big  enough  for  T for  a specific  application?  Hence  one  usually  faces 
the  problem  of  overfitting.  Due  to  the  overfitting  concern  as  well  as  the  implemen- 
tation difficulty  {Rempif)  is  neither  convex  nor  differentiable.),  nearly  all  existing 
classification  algorithms  do  not  explicitly  try  to  achieve  the  goal  of  minimizing  the 
empirical  error.  For  example,  support  vector  machine  (SVM)  [78],  linear  discriminant 
analysis  (LDA)  [30]  and  one-layer  neural  network  (NNW)  [5],  all  pursuing  a linear 
classifier,  have  different  strategies  for  finding  / from  T:  SVM  maximizes  the  margin 
of  a classifier  while  LDA  maximizes  the  generalized  Rayleigh  quotient  and  one-layer 
NNW  minimizing  the  mean  square  error.  Given  infinite  training  data,  they  all  lead 
to  (or  approximately)  the  optimal  Bayesian  rule.  However,  with  a limited  number  of 
training  data,  different  strategies  can  lead  to  different  generalization  performances.  In 
this  sense,  given  a family  of  hypothesis  functions,  a learning  problem  can  be  restated 
as  finding  a strategy  to  minimize  training  errors  while  limiting  overfitting. 

In  the  following,  we  summarize  some  notations  we  will  use  throughout  the 
dissertation.  Refer  to  Appendix  A for  more  detailed  descriptions,  we  use  a bold 
lowcase  letter  to  denote  a vector  and  a bold  uppercase  letter  to  denote  a matrix.  The 
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entry  of  a matrix  Z is  written  as  Zij.  z.j  and  z^.  are  the  column  and  row  of 
Z,  respectively.  We  also  use  a to  denote  the  unnormalized  vector  of  a,  i.e.,  a = 
where  ||  • ||p  is  the  p-norm.  The  margin  of  the  sample  x„  with  respect  to  the  classifier 
/ is  defined  as  p(x„)  = yn/(xn)-  Note  that  if  / : x — )•  [— 1,+1],  p(x„)  measures  the 
confidence  of  the  correctness  of  the  classification.  The  margin  of  the  classifier  / (or 
simply  margin)  is  defined  as:  p = maxi<„<Ar  p(x„). 

2.2  Ensemble  Learning 

Within  the  last  decade,  large  margin  classification  techniques  have  emerged 
as  a practical  result  of  the  theory  of  generalization.  Two  typical  examples  of  large 
margin  classifiers  are  support  vector  machine  and  AdaBoost.  These  are  two  most 
active  research  areas  in  the  machine  learning  community.  In  this  work,  we  mainly 
focus  on  the  AdaBoost  algorithm.  We  start  describing  AdaBoost  from  a more  general 
case:  ensemble  learning. 

2.2.1  Ensemble  Learning 

The  basic  idea  of  ensemble  learning  is  to  combine  a set  of  simple  “rules” 
to  form  an  ensemble  such  that  the  performance  of  the  single  ensemble  member  is 
improved.  Mathematically,  it  can  be  formulated  as:  given  a class  of  hypothesis 
functions  H = {ht  : x — {±l},t  = 1,  • • • ,T},  called  weak  learners  or  base  learners, 
we  construct  an  ensemble  function  F(x)  as: 

T ^ T 

F(x)  = ^atfit(x)  and  f{x)  = ^ _ F(x)  --  (2.3) 

t=i  ^t=i  t=i 

such  that  a certain  cost  function  L{f{x),y)  is  optimized.  Both  the  combination 
coefficients  d and  the  hypothesis  functions  ht{x)  are  learned  in  the  learning  process. 

The  ensemble  learning  has  a deep  root  in  engineering  and  sciences.  The  en- 
semble models  can  be  distinguished  based  on  the  properties  of  the  basis  functions 
used  to  construct  ensemble  functions  and  the  cost  function  aimed  to  optimized.  For 
example,  basis  functions  can  be  complete  and  orthogonal,  such  as  the  well-known 
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Fourier  Transform  and  Discrete  Wavelet  Transform  [58].  For  these  models,  the  com- 
binational coefficients  can  be  simply  found  by  inner  product.  In  other  ensemble 
models,  such  as  matching  pursuits  [44],  the  basis  functions  are  usually  taken  from 
an  over-complete  and  non-orthogonal  dictionary  and  a stage-wise  forward  procedure 
is  usually  employed  to  find  both  basis  functions  and  combination  coefficients.  All 
of  aforementioned  models  are  mainly  for  signal  representation  and  therefore  a mean 
square  error  cost  function  is  optimized.  In  machine  learning,  (2.3)  is  called  “leverag- 
ing” or  “boosting”,  whose  goal  is  to  accurately  assign  labels  to  patterns.  Therefore, 
hypothesis  functions  are  used  as  basis  functions  and  the  empirical  error  or  its  upper- 
bound  is  used  as  cost  functions.  For  example,  the  well-known  AdaBoost  minimizes 
an  exponential  loss:  exp{—yF{x))  and  the  LogitBoost  algorithm  [29]  minimizes  the 
Logistic  loss:  log2(l  -I- exp(— yF(x))). 

2.2.2  Why  Ensemble  Learning? 

In  the  past  decades,  one  of  the  main  research  areas  in  machine  learning  is 
to  find  methods  for  constructing  ensemble  of  learning  machines.  Empirical  studies 
showed  that  ensemble  learning  can  effectively  boost  the  performance  of  the  individual 
base  learners.  One  simple  yet  intuitive  explanation  of  the  effectiveness  of  ensemble 
learning  is  given  by  Dietterich  [20]:  suppose  that  we  have  a binary  classification  prob- 
lem and  T hypothesis  functions  whose  errors  are  lower  than  0.5,  then  the  resulting 
majority  voting  ensemble  has  an  error  lower  than  the  single  classifier,  as  long  as  the 
errors  of  the  base  learners  are  uncorrelated.  In  fact,  if  we  have  20  classifiers  and  the 
error  rate  of  each  base  learner  are  all  equal  to  p — 0.3  and  the  errors  are  independent, 
the  overall  error  of  the  majority  voting  ensemble  will  be: 


A strong  assumption  in  the  above  simple  example  is  that  the  outcome  of  base  learners 
are  uncorrelated.  If  the  independence  assumption  does  not  hold,  we  cannot  assure 
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that  ensemble  learning  can  reduce  errors.  It  also  implies  that  the  effectiveness  of 
ensemble  learning  depends  on  the  diversity  and  accuracy  of  base  learners.  In  real 
applications,  correlations  always  exist.  Therefore  the  error  incurred  by  an  ensemble 
learner  cannot  be  simply  calculated  as  above.  Nevertheless,  the  above  example  gives 
us  an  intuitive  explanation  as  to  why  ensemble  learning  works. 

Another  reason  for  using  ensemble  learning  arises  from  the  limited  represen- 
tation capability  of  learning  algorithms.  In  many  cases  the  unknown  function  to  be 
estimated  is  not  present  in  but  a combination  of  hypothesis  drawn  from  H can 
expand  the  space  of  representation  functions  including  also  the  true  one. 

The  need  for  ensemble  learning  can  also  be  explained  by  the  classic  bias- 
variance  analysis.  It  was  shown  that  many  ensemble  methods  can  reduce  variance  or 
both  variance  and  bias.  Finally,  two  recently  developed  ensemble  learning  algorithms: 
AdaBoost  and  Arc-GV  [9]  which  learn  the  combination  coefficients  and  hypothesis 
functions  adaptively  can  be  shown  to  effectively  increase  the  margin  of  resulting 
classifier,  leading  to  an  improved  generalization  capability. 

2.3  AdaBoost 

2.3.1  AdaBoost 

In  the  past  several  years,  several  ensemble  methods  [27], [8], [9]  have  been  devel- 
oped. One  simple  yet  very  effective  algorithm  is  the  Bagging  (bootstrap  aggregation) 
algorithm  [8].  It  trains  base  learners  on  a set  of  training  samples  which  is  generated 
by  randomly  drawing  the  original  training  set  with  replacement.  The  combination 
coefficient  at  is  simply  set  to  be  1/T  where  T is  the  number  of  total  iterations. 
Although  simple.  Bagging  is  shown  to  effectively  improve  the  performance  of  “un- 
stable” learning  algorithms  where  small  changes  in  the  training  set  result  in  large 
changes  in  predictions.  Breiman  [8]  claimed  that  neural  networks  and  decision  trees 
are  examples  of  unstable  learning  algorithms. 
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A more  sophisticated  ensemble  method  is  AdaBoost  [69],  [68],  [27].  AdaBoost 
is  generally  considered  as  a first  step  towards  more  practical  boosting  algorithm 
development.  The  pseudo  code  of  AdaBoost  is  presented  as  follows: 

AdaBoost 

Initialization:  V = {{xn,yn)}n=i>  Maximum  iteration  number  T,  — 1/N 

for  t = 1 :T 


1.  Train  weak  learner  with  respect  to  distribution  and  get  hypoth- 
esis hf(x)  : X — >•  {±1}. 

2.  Calculate  the  weighted  training  error  et  of  ht. 


N 


n=l 


where  I(-)  is  the  indicator  function. 

3.  Compute  the  combination  coefficient: 


(2.6) 


4.  Update  weights: 


d(‘+i)(n)  = exp  {-atynhtM)  /Ct  (2.7) 


where  Ct  is  the  normalization  constant  such  that  J2n=i  — 1. 


end 


Output  : F(x)  = YlJ=i 


The  main  idea  of  AdaBoost  is  to  repeatedly  apply  the  base  learning  algorithm 
to  the  re-sampled  versions  of  the  training  data  to  produce  a collection  of  hypothesis 
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Figure  2.1:  AdaBoost  repeatedly  applies  the  base  learning  algorithm  to  the  re- 
sampled versions  of  the  training  data  to  produce  a collection  of  hypothesis  functions 
which  are  finally  combined  via  a weighted  linear  vote  to  form  the  final  decision. 

functions  which  are  finally  combined  via  a weighted  linear  vote  to  form  the  final 
decision  (Figure  2.1).  An  intuitive  idea  in  AdaBoost  is  that  the  examples  which  are 
misclassified  get  more  weights  in  the  following  iterations  (Equation  (2.7))  and  hence 
the  subsequent  classifiers  focus  more  and  more  on  those  harder  cases,  for  instance, 
the  samples  near  the  decision  boundary.  AdaBoost  initially  sets  the  probability  of 
picking  each  sample  to  be  l/N.  In  each  iteration,  a base  learner  is  trained  with 
respect  to  the  distribution  to  minimize  the  weighted  training  error  Cj.  Then  dj 
is  calculated  based  on  the  performance  of  the  base  learner.  For  example,  if  Ct  = 
i.e.,  there  is  no  information  that  ht  can  contribute  to  the  final  ensemble,  then  at  = 0. 
In  AdaBoost,  we  assume  that  7i  is  negative  close.  That  is,  h E y.  means  —h  G H. 
It  implies  Cf  < 1/2  since  otherwise  we  can  simply  use  —h  instead  of  h.  After  at  is 
computed,  the  distribution  is  updated.  Note  that  if  hf(x„)  agrees  with  the  weight 
is  decreased  and  otherwise  the  weight  is  increased. 
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Figure  2.2:  The  exponential  loss  used  in  AdaBoost  is  an  upper  bound  of  the  0/1  loss. 
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The  original  AdaBoost  algorithm  uses  binary-valued  functions  as  base  learn- 
ers. Schapire  and  Singer  [69]  extended  the  algorithm  to  a more  general  version  where 
real-valued  hypothesis  functions  are  used,  that  is,  ht{x)  : x [—1, 1]  with  sgn(/i((x)) 
being  the  class  label  and  |ht(x)|  the  classification  confidence.  In  this  work,  unless 
otherwise  noted,  we  use  binary-valued  base  learners.  However,  all  of  the  analyses  can 
readily  be  carried  over  to  the  case  of  real- valued  base  learners. 

2.3.2  AdaBoost  as  Functional  Gradient  Descent  Procedure 


At  the  first  glance,  it  is  not  clear  what  this  algorithm  is  doing.  In  [47],  an 

interesting  interpretation  of  AdaBoost  as  an  algorithm  performing  the  stage-wise 

gradient  descent  procedure  in  the  sample  average  of  a cost  function  of  the  margin 

distributions  was  provided.  In  particular,  the  cost  function  is  defined  as: 

1 ^ 1 ^ / \ 

^ ( -PM  X (2.8) 

n=l  n=l  V t / 

where  p(x„)  = ynfi^n)  is  the  margin  of  the  sample  x„  with  respect  to  the  ensemble 

classifier  /(x„).  Note  that  the  cost  function  defined  above  is  in  effect  an  upperbound 

of  the  empirical  error  i?emp  (Figure  2.2). 

At  the  iteration,  define  the  negative  functional  derivative  of  E at  Ft_i  as: 


0 


if  X ^ x„ 


VF(F(  i)(x)  <{  i^^exp(-?/„F(_i(x„))  if x = x„,  n = l,---,N 
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which  is  the  direction  in  which  the  cost  function  most  rapidly  decreases.  Since  we 
are  restricted  to  choosing  our  new  function  ht  from  71,  we  may  not  be  able  to  choose 
ht  — —VE.  Instead  we  search  for  an  ht  such  that  the  inner  product: 

1 ^ 

{-VE,ht)  = -^^exp(-y„Ft_i(x„))?/„/if(x„)  (2.10) 


n=l 


En=iexp(-?/nFt-l(Xn))  A exp(-y„Ft_i (x„)) 


N 


E 


^ ^ E]Iiexp(-2/iFt_i(xi)) 

is  maximized.  By  unraveling  Equation  (2.7),  we  get: 


ynhtM 


d^^\n)  = ^)(n)exp(-o;t_i?/„/it_i(x„))/C't_i 


exp(-y„Ft-i(xn)) 

E^Ii  exp(-?/iFf_i(xi)) 


(2.11) 


Therefore, 


ht 


arg  max(— V£^,  h) 
hen  ' 


N 


= arg  max 

hen  ^ 

n=l 


dh\n)ynh{xn) 


— arg  max 
hen 


{n=l,yn=h{xn)} 


— argmax(l  — 2e() 
hen 


{n=l,V„^/l(Xn)} 


= arg  min  e# 
° hen 


(2.12) 


This  suggests  that  /i((x)  be  chosen  to  minimize  the  weighted  error  €(.  After  /i<(x)  is 
selected,  the  coefficient  ott  can  be  found  by  a line  search  to  minimize  the  intermediate 
cost  function: 


^ E ^E  + QlAl(Xn))  ) 


(2.13) 


and  in  the  binary  classification  case,  i.e.,  hC  = {h{x)  : x ->  ±1},  at  can  be  computed 
analytically  as  a solution  to  dEt/d&t  — 0,  which  gives  the  closed  form  in  Equation 
(2.6). 
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Another  similar  interpretation  of  AdaBoost  as  a gradient  descent  method  in 
a hypothesis  space  H presented  in  [63], [10]  is  to  consider  the  updating  of  the  distri- 
bution d^*\n)  as  to  normalizing  the  gradient  of  Et  with  respect  to  p(x„): 


dEt-i/dp{xn) 

T,j  dEt-i/dp{y.j) 


(2.14) 


Equation  (2.14)  provides  the  answers  to  the  question  of  which  pattern  should  increase 
the  margin  most  strongly  in  order  to  decrease  E maximally. 

2.3.3  Training  Error  Bound 


One  of  the  most  important  theoretical  properties  of  AdaBoost  is  its  capability 
to  reduce  training  errors  to  zero  exponentially  with  respect  to  the  number  of  iteration 
by  assuming  that  the  base  learners  can  achieve  an  error  rate  less  than  0.5  in  each 
iteration.  It  is  a mild  assumption  since  a random  guess  can  have  an  error  of  0.5. 


Theorem  2.3.1  [69]  With  AdaBoost  described  as  above,  the  following  bound  holds 
on  the  training  errors  of  the  ensemble  function  F{x): 

1 ^ 

Errors  — |{n  : sign{F{xn))  # y„}|  < 


(=1 


Proof:  Unraveling  the  distribution  update  rule  (2.7)  yields: 

(-  Tn=i  ditVnhtiy^n)^ 


NYlU.Ct 


Since  is  a distribution,  d^'^'^^^n)  = 1.  It  follows  that: 

T ^ N / T \ ^ N 

n ^tynhiM  ] exp(-2/„F(x„)) 

t=l  ^ n=l  V f=l  / ^ n=l 


(2.16) 


(2.17) 


Note  that  if  an  error  occurs,  i.e.,  sign(F(x„))  ^ ?/„,  then  ynF{xn)  < 0 and  I(y„F(x„)  < 
0)  < exp(— y„F(x„)),  which  implies  that: 

1 1 ^ 

— |{n  : sign(F(x„))  ^ y„}|  = — ^ I(?/„F(x„)  < 0) 


n=l 
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n=l 

T 

= nc-  (218) 

t=i 


From  Equation  (2.7),  summing  over  the  data  sample  index  n,  we  get: 

N 

Cf  = ^ exp(-o:tj/„/it(x„))  (2.19) 

n=l 

Plugging  (2.6)  into  (2.19),  we  get: 

C.  = f;  ^2  <!“’(") 

{n=l,J/n#/ie(xn)}  {n=l,s/„=/i((x„)} 

= 2y/et{l-€t)  (2.20) 

Substituting  Ct  into  (2.15),  we  see  that  the  training  error  of  AdaBoost  is  at  most 
2^  n^i  \/et(l  — q)-  Assuming  further  that  in  each  iteration  the  error  rate  of  ht  is 
less  than  say  cj  < | where  7 G [0, 1],  then  the  training  error  is: 

T T 

Error  < n \/l  -7^  < exp( — ^)  (2.21) 

t=i 

The  last  inequality  is  due  to  the  fact  that  y^l  — 7^  < exp(— ^)  if  7 G [0, 1].  Therefore, 
it  only  takes  at  most  iterations  for  AdaBoost  to  find  an  ensemble  classifier 
achieving  a zero  training  error. 

2.3.4  Generalization  Error  Bound 

Driving  the  training  error  to  zero  dose  not  guarantee  that  the  final  classifier 
can  generalize  well  on  unseen  test  samples.  In  contrast,  one  may  suspect  that  Ad- 
aBoost can  quickly  lead  to  overfitting.  However,  there  is  a growing  body  of  empirical 
evidence  that  suggests  that  AdaBoost  can  also  effectively  reduce  the  generalization 
error  and  in  many  ceises  the  generalization  error  continues  to  decrease  even  after 
the  train  error  reaches  zero.  Much  work,  both  experimentally  and  theoretically. 
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has  been  done  to  understand  the  impressive  generalization  capability  of  AdaBoost 
[31], [47], [38], [39], [37], [70].  One  of  the  leading  explanations  is  the  margin  theory  [70], 
which  states  that  AdaBoost  is  particularly  aggressive  at  maximizing  the  margin  of 
the  resulting  classifier  and  a larger  margin  on  the  training  samples  leads  to  a supe- 
rior upper  bound  on  the  generalization  error.  The  main  result  is  summarized  in  the 
following  theorem: 

Theorem  2.3.2  [70]  Let  P be  a distribution  over  X x {±1},  and  let  V be  a sample 
of  N examples  chosen  independently  at  random  according  to  P.  Assume  that  the  base 
learner  space  H is  finite,  and  let  5 > 0.  Then  with  probability  at  least  1—5  over  the 
random  choice  of  the  training  set  V,  every  weighted  average  function  f satisfies  the 
following  bound  for  all  9 > 0.’ 

Prp(vS(-K)  < 0)  < Prp(yf(^)  <0)  + O + log(l/^)j  (2.22) 

In  the  early  days  of  AdaBoost  development,  some  researchers  conjectured 
that  AdaBoost  is  immune  against  overfitting.  Some  theories  have  been  developed  to 
support  this  claim.  Several  large-scale  experiments  [21], [4], [60]  based  on  artificial  and 
real-world  datasets  have  been  conducted  to  compare  AdaBoost  with  other  ensemble 
algorithms  such  as  Bagging  and  to  investigate  the  algorithm  behavior  of  AdaBoost. 
It  was  found  that  the  performances  of  AdaBoost  can  degrade  drastically  with  the 
presence  of  noise.  Here,  by  using  the  term  of  noise,  we  refer  to  the  following  cases: 
(i)  the  distribution  of  different  classes  are  overlapped;  (ii)  some  data  samples  are 
mislabelled.  It  is  common  in  some  applications,  e.g.,  in  breast  cancer  detection,  some 
benign  samples  could  be  labelled  as  cancer  samples  and  vice  versa  due  to  segmentation 
errors  and/or  human  misinterpretations.  The  main  reason  for  AdaBoost  performing 
so  poorly  on  noisy  data  is  that  AdaBoost  puts  too  much  algorithm  resources  on  several 
harder  samples  and  classifies  them  undiscriminatively  according  to  their  associated 
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labels.  Most  of  the  work  in  this  dissertation  is  to  devise  robust  AdaBoost  algorithms 
against  noise.  We  will  detail  our  discussions  in  the  following  sections. 

2.3.5  Connection  Between  AdaBoost  and  Minimax  Problem 

The  connection  between  the  well-known  minimax  problem  [79]  and  the  Ad- 
aBoost algorithm  was  first  noted  in  [10], [28].  In  fact,  the  problem  of  maximizing  the 
margin  of  a classifier  can  be  solved  exactly  by  using  linear  programming  (LP): 

max(p,a)  P 

subject  to  p(x„)  > p,  n = 1,  • • • , A (2.23) 

= l,a  > 0 

This  observation  motivated  Grove  et  al.  [31]  to  derive  an  LP  based  boosting  algo- 
rithm, referred  to  LP- AdaBoost,  to  directly  maximize  the  minimum  margin  of  an  en- 
semble of  the  existing  hypothesis  functions.  Several  experiments  with  LP- AdaBoost 
on  UCI  benchmarks  were  made  and  it  was  found  that  LP-AdaBoost,  although  capa- 
ble of  achieving  a large  minimum  margin,  almost  always  yields  a worse  performance 
than  AdaBoost.  This  result  is  as  expected.  As  we  will  show  in  the  following  chap- 
ter, LP-AdaBoost  leads  to  a classification  scheme  trying  to  optimize  the  performance 
in  the  worst  case.  In  the  noisy  data  case  where  the  data  distributions  are  highly 
overlapped  and  some  data  samples  are  even  mislabelled,  LP-AdaBoost  can  be  easily 
misled  by  a few  outlier  data  samples,  resulting  in  a suboptimal  performance.  In  [65] 
AdaBoost  is  regarded  as  a barrier  method,  which  may  explain  why  LP-AdaBoost 
always  yields  a worse  performance  than  AdaBoost.  In  the  same  paper.  Grove  also 
made  an  investigation  on  the  the  asymptotic  behavior  of  AdaBoost  which  is  usually 
overlooked  by  other  researchers.  It  was  found  that  after  a large  number  of  iterations, 
although  the  minimum  margin  of  a classifier  continues  to  increase,  the  testing  perfor- 
mance starts  to  deteriorate.  More  interestingly,  the  margin  produced  by  AdaBoost 
converges  to  the  optimum  solution  of  (2.23).  For  this  reason,  he  described  AdaBoost 
as  a slow  LP  solver.  This  claim,  though  still  lacks  strict  proofs,  is  widely  accepted  by 
many  researchers.  It  suggested  a new  direction  of  devising  new  ensemble  classifiers 
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directly  in  the  mathematical  optimization  setting  by  resorting  to  some  well  estab- 
lished optimization  techniques  and  conveying  new  ideas  into  the  Boosting  setting  for 
new  AdaBoost-like  algorithms  [31], [62],  [63], [64], [19].  This  is  also  the  starting  point 
in  this  dissertation. 


CHAPTER  3 

REGULARIZED  ADABOOST 

3.1  Introduction 

It  has  been  shown  that  in  the  low  noise  regime  AdaBoost  rarely  suffers  from 
overfitting  problems.  However,  recent  studies  [31],  [62],  [63]  with  highly  noisy  patterns 
clearly  showed  that  overfitting  can  occur  and  several  regularized  boosting  algorithms 
[63],  [26],  [64]  have  been  proposed  to  extend  the  applicability  of  AdaBoost  to  noisy 
data.  AdaBoostj^gg  [63]  is  one  of  the  first  boosting-like  algorithms  that  achieved  the 
state-of-the-art  generalization  results  on  noisy  data.  It  implemented  an  intuitive  idea 
of  controlling  the  tradeoff  between  the  margin  and  the  sample  influences  to  achieve 
a soft  margin.  In  their  experimental  evaluations,  it  was  found  to  be  among  the  best 
performing  ones.  However,  the  problem  with  AdaBoostj^gg  is  that  it  is  difficult  to 
analyze  the  underlying  optimization  scheme  since  the  modification  is  done  on  the 
algorithm  level  [49],  [62]. 

In  this  chapter,  we  study  the  regularized  AdaBoost  algorithms  from  the  view- 
point of  mathematical  programming  and  thereby  propose  two  regularization  schemes 
to  improve  the  robustness  of  AdaBoost  against  noisy  data.  By  studying  the  con- 
nection between  the  minimax  optimization  problem  and  AdaBoost,  we  show  that 
the  good  generalization  performance  of  AdaBoost  can  be  explained  by  the  fact  that 
the  classification  performance  in  the  worst  case  is  optimized,  which  also  explains 
that  in  noisy  data  cases  AdaBoost  will  eventually  lead  to  overfitting  since  the  data 
samples  can  be  highly  overlapped  and  even  mislabelled.  Therefore  some  forms  of 
regularization  are  mandatory.  A natural  regularization  strategy  is  to  use  the  concept 
of  soft  margin,  i.e.,  the  algorithm  does  not  attempt  to  classify  all  of  the  training 
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samples  according  to  their  labels  but  allows  for  some  errors.  One  typical  example 
is  LPreg-AdaBoost  which  is  the  underlying  optimization  scheme  supporting  v-kic 
[64]  and  C-Barrier  algorithm  [62].  LPreg-AdaBoost  introduces  slack  variables  into 
an  optimization  problem  in  the  primal  domain  in  the  same  way  as  Support  Vector 
Machine  (SVM)  does  towards  the  non-separable  data  case  to  achieve  a soft  margin 
[15].  In  the  dual  domain,  it  is  equivalent  to  constraining  the  distributions  to  a box, 
which  can  be  understood  as  adding  a penalty  of  0 within  the  box  and  oo  outside  the 
box.  Therefore,  this  scheme  is  somewhat  heuristic  and  may  be  too  restrictive  [48].  In 
this  chapter,  we  instead  consider  controlling  the  distribution  skewness  by  adding  a 
convex  penalty  function  to  the  objective  function  in  a minimax  problem  formulation, 
which  leads  to  a piecewise  convex  optimization  problem.  Through  a linear  approxi- 
mation, this  problem  can  be  approximated  as  a linear  programming  problem  in  both 
the  dual  and  primal  domains.  Based  on  the  different  penalty  functions,  two  new  soft 
margin  concepts  are  defined  and  two  new  regularized  AdaBoost  algorithms,  referred 
to  as  AdaBoost AdaBoostnorm2i  are  proposed.  These  two  algorithms  can 
be  considered  as  extensions  of  AdaBoostj^gg  for  pursuing  a soft  margin.  However, 
the  proposed  algorithms  have  the  following  merits:  (i)  The  underlying  optimization 
schemes  are  clear.  Both  algorithms  can  be  viewed  as  performing  a stage-wise  gradi- 
ent descent  procedure  to  minimize  a cost  function  of  the  soft  margin  distributions; 
(ii)  Both  algorithms  can  achieve  better  performances  than  AdaBoostp^gg.  In  partic- 
ular, the  performance  of  AdaBoostj^^  ^^e  best  among  the  regularized  AdaBoost 
algorithms  we  have  tested  in  this  chapter. 

We  also  test  our  algorithms  on  the  benchmark  USPS  (US  Postal  Service) 
database  of  handwritten  digit.  Since  AdaBoostnorma  and  AdaBoost sim- 
ilar performances  on  the  UCI  data,  in  this  experiment,  we  only  focus  on  AdaBoostj^^. 
We  artificially  add  some  feature  noise  and  change  class  labels.  Through  a set  of 
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carefully  designed  experiments,  we  show  remarkable  classification  improvement  of 
AdaBoostj,^L  over  the  original  AdaBoost  algorithm. 

The  rest  of  the  chapter  is  organized  as  follows.  In  Section  3.2  we  study  the 
connection  between  the  minimax  optimization  problem  and  AdaBoost.  Based  on 
these  discussions,  two  new  algorithms,  namely  AdaBoost AdaBoostnormzi 
are  proposed.  To  demonstrate  the  effectiveness  of  the  proposed  algorithms,  in  Section 
3.4  a large  scale  experiment  based  on  several  artificial  and  real  world  datasets  are 
performed.  The  results  are  compared  with  those  of  the  AdaBoostj^gg,  u-Arc,  C- 
Barrier  and  SVM  algorithms.  We  finally  conclude  the  chapter  in  Section  3.5  with  our 
remarks. 


3.2  Regularized  AdaBoost 


We  start  with  the  minimax  problem.  The  connection  between  the  well-known 
minimax  problem  [79]  and  the  AdaBoost  algorithm  was  first  noted  in  [10], [28]  and 
was  used  to  determine  the  maximum  margin  that  one  can  achieve  given  a hypothesis 
class  by  exploiting  the  dual  relationship  in  linear  programming.  For  the  sake  of 
simplicity,  at  the  moment,  we  assume  that  the  cardinality  of  the  hypothesis  function 
set  is  finite  and  is  equal  to  T.  Define  a gain  matrix  Z 


Zn 


Z = 


ZiT 


zn\  • • • Znt 


where  Zij  = yihj{xi)  is  the  margin  of  the  sample  x,  with  respect  to  the  hypothesis 
function  hj. 

Now  let  us  look  at  the  following  minimax  optimization  problem: 


max  min  d^Za  (3.1) 

aerT derN 

where  is  the  distribution  simplex  defined  as  = {a  : a G = 

l,a  > 0}.  The  optimization  scheme  can  be  simply  understood  as  finding  a set  of 
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combination  coefficients  a G F^,  such  that  the  performance  of  the  ensemble  classifier 
in  the  worst  case  is  maximized.  It  is  easy  to  show  that  this  classification  scheme  will 
lead  to  the  maximum  margin  scheme  in  (2.23). 

maxcgr^  min^gr^  d^Za 

— maXojgpr  mini<ff<7v  ^t^nt  /g  2\ 

= maXcgpT  p ' 

subject  to  YlJ=i  OitZnt  > P,n  = l,---,N 

In  the  separable  case,  a large  margin  is  usually  conducive  to  good  generalization  in 
the  sense  that  if  a large  margin  can  be  achieved  with  respect  to  the  data,  an  upper 
bound  on  the  generalization  error  is  small.  However,  in  the  noisy  data  case  where  the 
data  distribution  is  highly  overlapped  and  some  data  samples  can  even  be  mislabelled, 
the  maximum  margin  scheme  can  be  easily  misled  by  the  outlier  data.  Consequently 
it  will  lead  to  a classifier  with  a suboptimal  performance.  Note  that  in  the  minimax 
problem  (3.1),  the  minimization  takes  place  over  the  entire  probability  space,  which 
is  not  sufficiently  restrictive.  A natural  strategy  is  to  constrain  the  distribution  or 
add  a penalty  term  to  the  cost  function  to  control  the  distribution  skewness  so  that 
the  algorithm  is  not  allowed  to  use  all  of  its  resources  to  deal  with  several  hard-to- 
learn  samples.  In  the  following  subsection,  we  present  three  regularized  AdaBoost 
algorithms  that  fall  within  this  framework. 

3.2.1  LPreg-AdaBoost 

By  constraining  the  distribution  to  a box  0 < d < c,  we  get  the  following 
optimization  problem: 

max  min  d^Za  (3.3) 

aer^  {der^,d<c} 

where  c is  a constant  vector  and  usually  takes  a form  of  c = Cl  with  C being  a 
predefined  parameter  and  1 G being  a vector  of  all  ones.  The  physical  meaning 
of  (3.3)  can  be  understood  as  to  finding  a set  of  combination  coefficients  a such 
that  the  classification  performance  in  the  worst  case  within  the  distribution  box  is 
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maximized.  The  LP  equivalent  to  (3.3)  is: 

maX(p  Q,)  p 

subject  to  > p- A„,n  = .n 

An  > 0,n=  ^ ’ 

a G 

LPreg-AdaBoost  [63]  is  a special  case  of  (3.4)  obtained  by  setting  ci  = C2  = • • • = 
Cm  = C.  A similar  scheme  is  also  used  in  Support  Vector  Machine  [15]  for  nonsepara- 
ble  data  cases.  (3.4)  introduces  a nonnegative  slack  variable  A„  into  the  optimization 
problem  to  achieve  a soft  margin  for  a pattern: 

Psip^n)  — pip^n)  A An  (3-^) 


The  relaxation  of  the  hard  margin  allows  some  patterns  to  have  a smaller  margin  than 
p and  the  algorithm  does  not  classify  all  of  the  patterns  according  to  their  associated 
class  labels.  The  dual  of  (3.4)  can  be  derived  as  followings.  The  Lagrangian  of  (3.4) 
is: 


N 


N 


L(p,  A,a;d,e,7,0)  = ^CnAn  - p + ^dn(p- An  - J]] at ^nt)  + 

n=l  n=l  f=l 

N / T \ T 

Cn(— An)  + 7 ( ^ at  — 1 I + ^ 6t{—at)  (3.6) 


n=l 


, t=l 


t=l 


where  7,  d > 0,  e > 0, 0 > 0 are  the  Lagrange  multipliers.  The  Lagrangian  should 
be  minimized  with  respect  to  p,  A,  a and  maximized  with  respect  to  d,€,7,0.  By 
taking  the  derivative  of  the  Lagrangian  with  respect  to  p,  A,  a and  setting  them  to 
zero,  we  get: 


dL 


N 


N 


-^  = '^dn-l=0^^dn  = l 


n=l 


n=l 


dL 

da. 


N 


N 


= 7 - X]  - 6>(  = 0 ^ dnZnt  = P - 0t,l  <t  <T 


(3.7) 


(3.8) 


n=l 


n=l 


dL 

dXr, 


— Cn  dfi  Cn  — 0 dfi  — Cn  Cn,  1 ^ n ^ iV 


(3.9) 
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Plugging  (3. 7), (3. 8)  and  (3.9)  into  (3.6),  we  get  the  dual  form  of  (3.4): 


d < c,  d G r 


(3.10) 


An  important  observation  is  the  dual  relationship  between  achieving  a soft  margin 
and  controlling  the  distribution  skewness.  Although  the  concept  of  overfitting  in 
the  dual  domain  is  not  as  straightforward  as  those  in  the  primal  domain,  in  term  of 
pursuing  a regularization  scheme,  working  directly  in  the  dual  domain  can  be  more 
effective  since  defining  a soft  margin  is  not  straightforward  in  the  primal  domain 
except  for  the  one  defined  in  (3.5). 

For  the  convenience  of  the  following  discussions,  we  reformulate  (3.3)  slightly 


Note  that  the  box  defined  by  {d  : ||d||oo  < (7,  d G F^}  is  centered  on  the  distri- 
bution center  do  = [1/N,  - • • ,1/N]  and  the  parameter  C reflects  to  some  extent  the 
distribution  skewness  between  the  box  boundary  and  do.  (3.11)  shows  that  LPreg- 
AdaBoost  can  be  considered  as  a penalty  scheme  with  a penalty  of  0 within  the  box 
and  oo  outside  the  box.  Therefore,  this  scheme  is  somewhat  heuristic  and  may  be 
too  restrictive.  Some  other  smooth  penalty  functions  can  be  considered. 

We  make  a brief  discussion  on  the  implementation  of  LPreg-AdaBoost.  In 
practical  applications,  the  cardinality  of  the  hypothesis  function  set  can  be  infinite 
and  thus  the  gain  matrix  Z does  not  exist  in  an  explicit  form.  As  a result,  the  linear 
programming  cannot  be  implemented  directly.  To  overcome  the  problem,  several 
algorithms  have  been  proposed.  Two  typical  examples  are  the  u-krc  [64]  and  C- 
Barrier  algorithms  [62].  We  will  compare  these  methods  with  our  proposed  algorithms 


as: 


max  mm  d^Za  + ^3(||d||oo)||d||oo 


(3.11) 


where  ||  • ||p  is  the  p-norm  and  P{P)  is  a function  defined  by: 


(3.12) 
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in  Section  3.4.  From  now  on  we  use  \1-L\  to  denote  the  cardinality  of  the  hypothesis 
function  set  and  reserve  T as  the  iteration  number  of  the  AdaBoost  algorithm. 

3.2.2  AdaBoostj^L 

To  control  the  skewness  of  the  distribution,  one  strategy  is  to  add  a penalty 
term  F’(d),  which  measures  the  distances  between  the  query  distributions  and  the 
distribution  center,  to  the  cost  function  of  (3.1).  It  leads  to  the  following  optimization 
problem: 

max  min  d^Zoc  + PP(d)  (3.13) 

aeri’^ider'v 

where  > 0 is  a predefined  parameter  controlling  the  distribution  skewness  and  the 
training  performance. 

With  a mild  assumption  of  P{d)  being  a convex  function  of  d,  we  have  the 
following  lemma: 

Lemma  3.2.1  If  P{d)  is  a convex  function  ofd,  the  following  optimization  schemes 
are  equivalent: 

maXjjgri'wi  mindgr^  dPZcx  + PP{d) 

= minder^  7 + /3F(d)  (3.14) 

subject  to  dnZnj  <7,  J = 1,  • • • , 

To  prove  Lemma  3.2.1  we  need  to  the  following  theorem. 

Theorem:  Generalized  Minimax  Theorem  [24]  3.2.1  Let  A be  a compact  con- 
vex subset  of  a vector  space  W,  and  U be  a convex  subset  of  a vector  space  Z.  Let  f 
be  a real-valued  function  defined  on  A x H such  that: 

(1)  f{x,y)  is  a convex  and  lower-semicontinuous  function  of  x for  each  y; 

(2)  f{x,y)  is  a concave  function  of  y for  each  x.  Then 

inf  sup /(x,?/)  = sup  inf /(a;,?/)  (3.15) 

xeA  ygn  yen 

Proof  of  Lemma  3.2.1: 

Since  the  function  d^Za  + /3P(d)  is  convex  in  d and  concave  in  a,  and  the  sets  P^ 
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and  are  convex  and  compact,  the  max  and  min  in  the  first  optimization  problem 

of  (3.14)  can  be  interchanged: 

maxc,gri«i  niind^piv  d^Za  + PP{d) 

= minder^  max^gpi'wi  d^Za  + PP{d) 

= minder^  maxi<j<|^l  + /3P(d)  (3.16) 

= minder/^  7 + /5P(d) 

subject  to  X)n=i  dnZnj  < 7,  J = 1,  ’ ‘ ‘ , l"^! 


Lemma  3.2.1  tells  us  that  if  we  use  a convex  penalty  function  of  d,  the  regularization 
scheme  can  be  pursued  directly  in  the  dual  domain. 

One  commonly  used  metric  for  the  discrete  distribution  is  the  Kullback-Leibler 
distance  which  is  given  as  [17]: 

KL(d,d„)  = ^rf„lnYi  (3.17) 

KL(d,  do)  is  convex  in  d > 0 since  the  Hessian  matrix  is  positive  definite. 

Following  Lemma  3.2.1,  we  write  the  regularized  scheme  in  the  dual  domain 
as: 

min(^,d)  1 + ^Yln=idn\n^ 

subject  to  En=id„2nj  < 7>i  = (3-18) 

d e 

This  scheme  is  also  suggested  in  [65]  from  the  viewpoint  of  the  Total  Corrective 
Algorithm  [40].  The  problem  (3.18)  can  be  reformulated  as: 
min(^,d)  7 

subject  to  Sj(d)  = dn^nj + /SEiILidn In < 7,  i = (3.19) 

d e 

Define  s(d)  = maxi<j<|^|  Sj(d).  Note  that  s(d)  is  also  a convex  function. 
Suppose  now  we  have  a set  of  query  distributions  S = {dd)}^^.  For  each  query 
distribution  d^9^  can  find  a supporting  hyperplane  to  the  epigraph  of  s(d).  The 
equation  of  the  supporting  hyperplane  is  given  by: 


7 = s(dW)+C,*(d-dW) 


(3.20) 
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where  Q is  an  element  of  the  subdifferential  of  s at  Due  to  the  con- 

vexity of  s(d),  a supporting  hyperplane  gives  an  underestimate  of  s.  More  precisely, 
Equation  (3.20)  can  be  written  as; 


7 


where 


max  s,  (d^*^) -f  C/(d  — d^‘h 


n=l  ' 


^z.t  + ^In 


/ 

r 1 

Iny^  + l 

\ 

Z.t  + ^ 

V 

At) 

- T/v  + 1 . 

/ 

(d  - d(*)) 


(3.21) 


z.f  = [yiht{:Ki),  • • • , yNht{xN)f-, 


(3.22) 


and 


N 

ht  = argm^^dW/i(x„)y„ 

n=l 


(3.23) 


Define: 


Z = Z + /3 


In 


d(i) 


,ln 


d(D 


(3.24) 


1/N’  ’"'l/iV. 

where  each  column  z.t  of  Z is  found  based  on  (3.22)  and  (3.23).  Note  that  Z can 


be  interpreted  as  a new  gain  matrix  and  it  means  that  adding  a penalty  function  to 
(3.1)  ends  up  with  a modification  of  the  gain  matrix  which  encodes  the  distribution 
information  into  the  hypothesis  decisions.  Now  the  optimization  problem  (3.18)  can 


be  approximated  as: 


min(.y,d)  7 

subject  to  z^d  < 7,  t = 1,  • • • , T (3.25) 

d e 

It  is  a linear  programming  that  is  easier  to  deal  with  than  the  original  problem. 
However,  this  is  only  a linear  approximation  that  gets  better  as  more  constraints 
are  added.  The  query  distributions  can  be  obtained  through  the  column  generation 
technique  and  the  finite  convergence  of  the  optimization  problem  (3.18)  can  be  guar- 
anteed. However,  column  generation  usually  shows  a pattern  of  slow  convergence 
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due  to  the  degeneracy  of  (3.25)  and  produces  many  unnecessary  query  distributions 
or  columns.  In  [19]  column  generation  was  used  for  implementing  LPreg-AdaBoost. 
The  problem  of  the  slow  convergence  was  alleviated  by  setting  a lower  bound  for  each 
query  distribution,  i.e.,  Ql  < < Cl  with  Ci  <C  C,  and  consequently  the  possible 

query  distributions  are  constrained  into  an  even  smaller  area.  In  our  case  of  d 6 T^, 
other  more  sophisticated  stabilized  column  generation  techniques  [52]  may  be  needed 
and  this  topic  should  be  the  subject  of  our  future  research.  In  this  chapter,  we  only 
focus  on  forming  a regularized  AdaBoost  classifier  to  approximately  solve  (3.25). 

The  dual  form  of  (3.25)  is: 

max(p,a)  p 

subject  to  ELi  + /5ES=i  = ps(x„)  > p,n  ==  1,  • • -,iV  (3.26) 

a G 

The  soft  margin  of  a pattern  x„  can  be  defined  as: 

T T ^(t) 

Ps(x„)  = + ^^tteln-^  (3.27) 

t=i  (=1  ' 

The  term  /?  In  ^ can  be  understood  as  a “mistrust”  in  examples.  The  ra- 

tionale is:  a pattern  which  is  often  visited  by  the  learning  algorithm  (i.e.,  hard  to 
classify  correctly)  will  have  a high  average  distribution  and  should  have  less  influ- 
ences on  the  outcome  of  the  final  classifier.  Note  also  that  the  mistrust  is  calcu- 
lated with  respect  to  the  center  distribution.  For  example,  if  the  query  distribution 
Sn  < 1/A^, ^ = 1,---,T,  the  “mistrust”  can  take  a negative  value.  As  a result,  the 
soft  margin  penalizes  some  hard  examples  and  at  the  same  time  awards  some  easy 
examples.  In  [63], [70],  it  was  experimentally  observed  that  AdaBoost  increases  the 
margin  of  the  most  hard-to-learn  examples  at  the  cost  of  reducing  the  margins  of 
the  rest  of  the  data.  Therefore,  defining  a soft  margin  in  the  way  of  (3.27)  can  be 
understood  as  softening  the  AdaBoost  process  with  the  strength  being  controlled  by 
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Now  with  a soft  margin  defined  in  (3.27)  and  following  the  same  strategy  as 
that  used  in  deriving  AdaBoostpj^gg  [62]  where  AdaBoost  is  used  as  a general  machine 
of  a minimax  problem  solver,  a new  AdaBoost-like  algorithm,  which  we  refer  to  as 
AdaBoost can  be  formulated.  Specifically,  we  define  a new  cost  function: 

N 

%L  = 

n=l  t 


N 

= X^exp 


n=l 


U>n 


(3.28) 


Y,atZnt  + ^J2at\nj^ 

. t t ' \ t 

where  d is  the  unnormalized  version  of  a.  The  combination  coefficient  ott  for  the 
hypothesis  ht  is  computed  as: 


5,  = argminG^L 


O(>0 


N 


n=l 


■ t t M) 

^ ^ In 

j=i  i=i 


1/A^ 


- j=i 


(3.29) 


It  is  difficult  to  compute  at  analytically.  However,  we  can  get  dt  efficiently  by  a line 
search  since  d^G^^i^/d^dt  > 0.  The  updated  distribution  is  computed  as 

the  derivative  of  respect  to  Ps(x„): 

N _ dG^JdpsjXn) 

^ E.dG^Jdpsixj) 


Ct 


exp  < -dtht{xn)yn  ~ Pdt  In 


tin 

TJn 


(3.30) 


where  Ct  is  the  normalization  constant  such  that  Yhn=i  dt+i{^n)  — 1- 
AdaBoostj^L  is  summarized  as  follows: 


AdaBoost 

Initialization:  V = {(x„,  ?/„)}()[_i.  Maximum  iteration  number  T,  S^\n)  = 1/A, 


parameter  P 
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for  t = 1 :T 


1.  Train  weak  learner  with  respect  to  distribution  and  get  hypoth- 
esis ht{x)  : X {±1}- 

2.  Calculate  the  coefficient  of  ht  as: 


iv  r r ( t ^(j)  1 t 

= ^ ^ ^ a,  In  ^ 

n=l  I Lj=l  j=l  ' J J=1 


(3.31) 


4.  Update  weights: 


d^*+^)(xn)  = ^--^^exp|-d(ht(x„)y„  -/5d(ln^^|  (3.32) 


where  Ct  is  the  normalization  constant  such  that  J2n=i  (^)  ~ ^ 


end 

Output  : F(x)  = d(/i((x) 


It  is  clear  that  if  /?  = 0,  we  retreat  to  the  original  AdaBoost  algorithm  and  if 
/3  — > oo,  we  can  prove  that  only  the  first  classifier  hi  is  kept,  i.e.,  at  = 0,  for  t > 2, 
which  corresponds  to  the  single  classifier  design  (See  the  proof  in  Lemma  3.2.2).  It 
means  that  by  varying  the  parameter  /3  we  are  able  to  control  the  boosting  strength 
of  the  learning  process  to  alleviate  the  overfitting  problem. 

Lemma  3.2.2  Suppose  in  each  iteration  the  learning  algorithm  can  find  a hypothesis 
such  that  Equation  (3.23)  holds.  If  fi  oo,  only  the  first  hypothesis  hi  will  be  kept 
in  AdaBoost i.e.,  at  = 0,  for  t >2. 

Proof: 

Suppose  now  we  have  found  hi  and  the  corresponding  combination  coefficient  ai. 
We  also  found  h2  and  need  to  determine  a2  by  line  search.  The  intermediate  cost 
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function  is  given  by: 

N 

^KL  ^ XI (--^i(xn)y„)  exp  {-Q!2/i2(x„)j/„  - a20ln(S^^N)} 
n— 1 

N N 

= ^exp(-Fi(xj)j/j)^4^)exp{-a2/i2(x„)?/„  - 0:2/5  ln(4^^A^)}3.33) 

j=l  n=l 

0:2  can  be  computed  by  taking  a derivative  of  with  respect  to  0:2  and  setting  it 
to  be  zero.  For  simplicity,  we  drop  the  constant  terms. 

N 

^^P  {““2/i2(x„)y„  - Q:2;51n(4^)A^)}  {-/i2(x„)?/„  - pin{d!^'>N)} 

n=l 

N 

= ^^P  (-<^2/i2(x„)?/„)  {-h2(x„)?/„  - p\n{S^^N)} 

n=l 

= 0 (3.34) 

By  setting  0:2  = 1//5  and  letting  ^ 00,  we  have: 

^ ^ exp(- ) {-/i2(x„)?/„  - ^ln(42)A^)}  > 0 

n=l  ^ 

(3.35) 

The  last  inequality  is  due  to  the  fact:  ^n=i  n la(dn  ^A^)  < 0 for  ^ do-  By  setting 
0:2  = l//5^  and  letting  /^  — >■  00,  we  have: 

n=l  ^ 

{-/i2(x„)y„  - /31n(4^^A'')} 

< 0 (3.36) 

The  last  inequality  is  due  to  the  fact:  '^^=1  ln{dn^N)  > 0 for  d^^^  7^  do.  There- 
fore, when  (0  — > 00,  0:2  G (Ji,  j)  —>■  0. 


3.2.3  AdaBoostiiorni2 

We  can  also  consider  using  an  Ip  norm  function  as  the  penalty  function.  It  is 
easy  to  show  that  ||d  — do||p  is  a convex  function  of  d and  following  Lemma  3.2.1,  we 
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get  the  following  regularized  scheme  in  the  dual  domain: 

min(^,d)  7 + ^||d-do||p 

subject  to  ^nZnj  < 7)  J = 1,  ’ ’ ' , (3-37) 

d G 

Particularly,  in  this  chapter,  we  only  focus  on  the  case  of  I2.  The  above  optimization 
problem  can  be  reformulated  as: 
min(7,d)  7 

subject  to  Sj(d)  = + iS|ld  - dolh  < 7,i  = I,---,  |7/|  (3.38) 

d G 


Define  s(d)  = maxi<j<|^|  Sj(d).  s(d)  is  also  a convex  function.  Suppose  now 
we  have  a set  of  query  distributions  S = For  each  query  distribution  d(*\ 

we  can  find  one  supporting  hyperplane  whose  equation  is  given  by: 


7 = s(dW)  + C(d  - d<‘>) 

- max  s,(d(*))  + C/(d-d(*)) 

= z^dW  + g||dW  - dolb  + (z,  + J (d-d“) 

where 

z.t  = [yi/i((xi),  • • • , yMht{xN)f,  (3.40) 


and 


N 


Define: 


Z = Z + /3 


ht  = argmax  VdW/i(x„)?/„. 

n^n 

n=l 

d(i)-do  d(^)-do 


(3.41) 


(3.42) 


.IldCO-dolb’  ’||d(^)-do||2j 

where  each  column  z.t  of  Z is  found  based  on  (3.40)  and  (3.41).  Now  the  optimization 
problem  (3.37)  can  be  linearly  approximated  as: 


min(7,d)  7 

subject  to  z^d  < 7,  t = 1,  • • • , T 

dG 


(3.43) 
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It  is  a linear  programming  that  is  easier  to  deal  with  than  the  original  problem.  Note, 
however,  that  this  is  only  an  approximation,  which  gets  better  as  more  constraints 
are  added. 

The  dual  form  of  (3.43)  is: 


max(p,c«)  P 

subject  to  Td=i(^tZnt  + = Psi^n)  > P,n  = l,---,N  (3.44) 

a e 


The  soft  margin  of  a pattern  x„  is  defined  as: 


Ps(x„)  = ^ atZnt  + at 


Un 


l/N 


(3.45) 


f=i 


t=i 


||dW-do||2 

Again  the  term  P YlJ=i  can  be  understood  as  a “mistrust”  in  examples 

with  respect  to  the  center  distribution.  The  parameter  P controls  the  tradeoff  between 
margin  and  “mistrust”.  It  is  interesting  to  note  that  the  soft  margin  in  (3.45)  is  very 
similar  to  that  in  AdaBoostj^gg  [62],  which  is  defined  as: 


PReg(Xn)  = '^a,z„,  + a,(iW 


(3.46) 


t=l 


f=l 


The  main  difference  is  that  our  soft  margin  is  calculated  with  respect  to  the  center  dis- 
tribution and  the  term  — do||2  can  be  roughly  understood  as  follows:  the  closer 
the  query  distribution  to  the  center  distribution,  the  more  trust  the  outcome  of  the 
hypothesis  deserves.  Now  following  the  same  strategy  used  in  deriving  AdaBoostj^L, 
the  optimization  problem  can  be  easily  reformulated  into  an  AdaBoost-like  algorithm, 
which  we  call  AdaBoostnormz-  We  can  define  a new  cost  function: 

I dn^  — l/N 


N 


G 


normz 


= ^exp^- 


n=l 


L t 


(3.47) 


The  combination  coefficient  at  is  computed  as: 


at  = argminG'SJjjrmz 

at>0 

N 

— arg  min  exp 

at>0^ 

n=l 


' t t Jj) 

Y^ajZnj  + pY,^. 

J=l  j=l 


l/N 


||d(i)  — do||2 


(3-«) 


j j=i 
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Again  we  can  get  at  efficiently  by  a line  search  since  d^GnoTm2/d^^t  > 0.  The 
updated  distribution  is  computed  as  the  derivative  of  Gnorm2  with  respect 

to  psi^n)' 


dG/dps{y.n) 

dG/dpsixj) 


^(0  _ 

exp  { -atht(y.n)yn  ~ nr  \ (3-49) 


G "'"1  — — -“^l|dW-do||2 

where  Ct  is  the  normalization  constant  such  that  ^n=i  = 1- 

AdaBoostnormz  is  summarized  as  follows: 


AdaBoostnorm2 

Initialization:  V = {(x„, Maximum  iteration  number  T,  = l/A', 

parameter  ^ 


for  t — \ \T 


1.  Train  weak  learner  with  respect  to  distribution  d^9  and  get  hypoth- 
esis h((x)  : X {±1}. 


2.  Calculate  the  coefficient  at  of  ht  as: 


N 


Oit 


n=l 


E 

j=i 


ajZjij 


a,- 


i=i 


d\^^  - 1/N 
||d0)-do||2 


E 

j=i 


(3.50) 


4.  Update  weights: 

where  Ct  is  the  normalization  constant  such  that  dt+\ (?^)  = 1 


end 


Output  : F(x)  = X]^idtht(x) 
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3.2.4  Framework  of  Regularized  Boosting  Algorithms 

In  this  subsection,  we  present  a framework  for  regularized  boosting  algorithms 
with  a constraint  on  the  distributions.  It  can  in  general  be  formulated  as  the  following 
optimization  problem  in  the  dual  domain; 

min(^,d)  7 + P{P)P{^) 

subject  to  z^d  < 7,  j = 1,  • • • , 1^1  (3.52) 

d G 

where  P(d)  is  a penalty  function  which  usually  measures  the  skewness  between  d 
and  the  center  distribution  do  and  /3{P)  is  the  parameter  which  can  be  a function 
of  P.  If  ,5  is  a predefined  parameter  and  P{d)  is  the  Kullback-Leibler  distance  or 
Euclidean  distance  with  respect  to  do,  respectively,  we  retrieve  the  AdaBoostj^L 
AdaBoostnormz  algorithms.  If  we  chose  P{d)  = ||d||oo  and  /3{P)  = 0 if  P < C and 
oo  if  P > C,  we  get  LPreg-AdaBoost.  Different  in  the  implementation  schemes,  both 
z^— Arc  and  C-Barrier  algorithms  fall  into  this  category. 

3.3  Further  Analysis  on  AdaBoostj^L  AdaBoostnormz 

As  for  the  AdaBoost  algorithm  itself,  a restrict  proof  of  these  two  proposed 
algorithms  achieving  the  optimum  solutions  for  the  associated  supporting  minimax 
problem  is  still  missing.  Therefore,  a further  analysis  directly  based  on  the  algorithms 
are  needed.  As  noted  in  Equations  (3.24)  and  (3.42),  adding  a penalty  term  to  a 
minimax  problem  ends  up  with  a modification  of  the  gain  matrix.  The  two  algorithms 
can  be  interpreted  as  performing  a gradient  descent  procedure  in  a new  hypothesis 
space  P.  (We  do  not  need  to  construct  it  explicitly  in  the  implementation.)  whose 
elements  h(x)  are  formed  as  follows:  for  each  possible  query  distribution  d G T^, 

/i(x„;  d)  = h*{xn)  + I3ynln  ^ AdaBoostj^L  .o  r:o\ 

h(x„,d)  = h (x„)  + /3yn ||d-do||2  ■^*^^®*^®®^norm2 

where  h*  = argmax/jg^^  d„/i(x„)y„.  Then  all  of  the  analyses  we  presented  in 
Section  2.2  (also  in  [47])  can  be  applied  directly  to  these  two  algorithms.  However, 
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since  we  do  not  explicitly  search  for  the  direction  h(x)  in  the  H space  in  AdaBoostp^L 
and  AdaBoostnorm2,  we  need  to  prove  that  the  direction  obtained  in  each  iteration 
is  indeed  the  one  we  should  choose  to  maximally  decrease  the  cost  functions  (3.28) 
and  (3.47).  The  followings  are  the  proof. 

We  first  consider  AdaBoostj^^.  Following  the  analysis  in  Section  2.2,  in 
iteration  we  should  serach  for  an  /i(x)  such  that: 

N N N 

{-VGklM)  oc  '^d^nhnhtM  In (3.54) 

n=l  n=l  n=l 

is  maximized.  It  is  not  difficult  to  prove  that  (3.54)  attains  the  maximum  when 
/it(x)  = fi(x;  d^*^)  by  exploiting  the  properties  of  the  Kullback-Leibler  distance. 

The  same  arguments  can  be  applied  to  the  case  of  AdaBoostnorm2-  Specifi- 
cally, we  need  to  show  that  dn^ ^'^d'olir  maximized  when  d = d^*^ : 

(d  - do)^ 


||d  - d 


0||2 


_j(0  ^ (d  - dp)^(dW  - do)  ^ |(d-do)^(d(*)  -do)|  ^ 


||d-do||2 


Id-d 


0 2 


(3.55) 


The  equalities  holds  by  the  Cauchy-Schwartz  inequality  when  d = d^‘) . 

Based  on  the  above  discussions,  we  have  shown  that  AdaBoostj^^ 
AdaBoostnorm2  are  the  algorithms  performing  a stage-wise  gradient  descent  in  the 
hypothesis  space  Ji  instead  of  H to  minimize  the  cost  function  of  the  soft  margin 
distributions,  where  H encodes  the  distribution  skewness  information  with  respect  to 
the  center  distribution  into  the  hypothesis  decisions.  However,  in  both  algorithms, 
we  do  not  need  to  construct  H explicitly.  The  same  technique  can  also  be  used 
for  AdaBoostj^gg  to  construct  a virtual  hypothesis  space.  However,  AdaBoostj^gg 
cannot  be  explained  as  a gradient  descent  algorithm  in  that  space  since  the  direction 
the  algorithm  searches  for  in  each  iteration  is  not  the  one  that  maximally  decrease 
the  associated  cost  function. 
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Table  3.1:  Summary  of  13  datasets 


No.  Features 

No.  Train 

No.  Test 

No.  Realizations 

Banana 

2 

400 

4900 

100 

Bc2incer 

9 

200 

77 

100 

Diabetis 

8 

468 

300 

100 

German 

20 

700 

300 

100 

Heart 

13 

170 

100 

100 

Hingnorm 

20 

400 

7000 

100 

Fsolar 

9 

666 

400 

100 

Thyroid 

5 

140 

75 

100 

Titanic 

3 

150 

2051 

100 

Waveform 

21 

400 

4600 

100 

Splice 

60 

1000 

2175 

20 

Twonorm 

20 

400 

700 

100 

Image 

18 

1300 

1010 

20 

3.4  Experimental  Results 

3.4.1  Experiments  Using  UCI,  DELVE  and  STATLOG  Datasets 

To  demonstrate  the  effectiveness  of  the  two  newly  proposed  algorithms,  a large 
scale  experiment  is  conducted  and  the  results  are  compared  with  the  AdaBoostj^gg, 
u-kvc  and  C-Barrier  algorithms. 

For  the  sake  of  comparison  fairness,  the  experimental  setup  herein  is  the 
same  as  those  used  in  [63].  Thanks  to  Ratsch’s  effort,  the  detailed  information 
about  the  experimental  setup  as  well  as  the  benchmark  dataset  can  be  found  at 
http://mlg.anu.edu.au/raetsch/data/index.html.  We  use  13  artificial  and  real-world 
data  sets  originally  from  the  UCI  [6],  DELVE  [1]  and  STATLOG  benchmark  reposi- 
tions: banana,  waveform,  diabetis,  breast  cancer,  ringnorm,  twonorm,  splice,  heart, 
german,  titanic,  thyroid,  flare  solar  and  image.  The  detailed  data  information  is  sum- 
marized in  Table  3.1.  Each  dataset  is  partitioned  into  100  realizations  of  training  and 
testing  data.  For  each  partition,  a classifier  is  trained  and  the  test  error  is  computed. 

The  RBF  (radial  basis  function)  net  is  used  as  the  weak  learner.  All  of  the 
RBF  parameters  are  the  same  as  those  used  in  [63].  We  use  the  cross-validation 
method  based  on  the  first  five  realizations  of  each  dataset  to  estimate  the  parameter 
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j3  of  the  regularized  AdaBoost  algorithms.  The  maximum  iteration  number  T is 
chosen  to  be  200. 

As  an  example,  in  Figure  3.1  we  present  some  training  and  testing  results  and 
margin  plots  of  three  methods:  AdaBoost,  AdaBoostnorm2  and  AdaBoostj^L  based 
on  one  realization  of  the  waveform  data.  AdaBoost  tries  to  maximize  the  margin  of 
each  pattern  and  hence  it  can  reduce  the  training  error  to  zero  effectively.  However 
it  quickly  leads  to  overfitting.  In  contrast,  AdaBoostnorm2  and  AdaBoostp^L  fo 
maximize  the  soft  margin  and  allow  some  hardest  examples  to  have  a small  margin. 
The  two  regularized  methods  can  effectively  alleviate  the  overfitting  problem.  To 
provide  a more  comprehensive  comparison,  in  Table  3.2,  the  average  classification 
results  (standard  deviations)  over  the  100  realizations  of  the  13  datasets  are  presented. 
Our  observations  are  as  follows: 

• AdaBoost  performs  worse  than  a single  RBF  classifier  in  almost  all  cases.  It  is 
clearly  due  to  the  overfitting  of  AdaBoost.  In  ten  (out  of  13)  cases  AdaBoostp^gg 
performs  significantly  better  than  AdaBoost  and  in  ten  cases  AdaBoostp^gg 
performs  better  than  a single  RBF  classifier. 

• The  results  of  AdaBoostnorm2  are  slightly  better  than  those  of  AdaBoostjpgg 
in  eight  cases,  and  except  for  two  cases  (heart  and  image),  the  results  of 
AdaBoostj^L  are  better  than  those  of  AdaBoostppgg.  For  a more  rigorous  com- 
parison, a 90%  significant  test  is  reported  in  Table  3.3.  For  some  data  sets  the 
performance  differences  are  small  (e.g.  titanic).  This  is  because  AdaBoostppgg 
is  already  a good  classifier  which  was  reported  to  be  slightly  better  than  Sup- 
port Vector  Machine  (RBF  kernel)  [63].  Nevertheless,  significant  improvements 
are  observed  for  AdaBoost ia  five  datasets  (out  of  13)  (Table  3.3). 

• In  Table  3.2,  we  also  compare  our  algorithms  with  other  regularized  boosting 
algorithms,  including  i>>-Arc  and  C-Barrier.  Again,  our  algorithms  perform 
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iteration  number  iteration  number 


(a)  (b) 


iteration  number  iteration  number 


(c) 


(d) 


iteration  number  iteration  number 


(e) 


(f) 


Figure  3.1:  Training  and  testing  results,  and  margin  plots  of  three  methods:  Ad- 
aBoost,  AdaBoostnorm2  and  AdaBoostp^L  based  on  the  waveform  data.  AdaBoost 
quickly  leads  to  overfitting  while  the  regularized  methods  effectively  alleviate  this 
problem. 
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Table  3.2:  Classification  Errors  (standard  deviation)  of  the  eight  methods.  The  best 


classifiers 


marked  in 

joldface. 

RBF 

AB 

ABr 

ABkl 

Banana 

10.8±0.6 

12.3±0.7 

10.9±0.4 

10.7±0.4 

Bcancer 

27.6±4.7 

30.4±4.7 

26.5±4.5 

26.1±4.4 

Diabetis 

24.3±1.9 

26.5±2.3 

23.8±1.8 

23.5±1.8 

German 

24.7±2.4 

27.5±2.5 

24.3±2.1 

24.2±2.2 

Heart 

17.6±3.3 

20.3±3.4 

16.5±3.5 

16.9±3.2 

Ringnorm 

1.7±0.2 

1.9±0.3 

1.6±0.1 

1.5±0.1 

Fsolar 

34.4±2.0 

35.7±1.8 

34.2±2.2 

34.1±1.6 

Thyroid 

4.5±2.1 

4.4±2.2 

4.6±2.2 

4.3±1.9 

Titanic 

23.3±1.3 

22.6±1.2 

22.6±1.2 

22.5±0.9 

Waveform 

10.7±1.1 

10.8±0.6 

9.8±0.8 

9.4±0.6 

Splice 

10.0±1.0 

10.1±0.5 

9.5±0.7 

9.2±0.6 

Image 

3.3±0.6 

2.7±0.7 

2.7±0.6 

2.7±0.6 

Twonorm 

2.9±0.3 

3.0±0.3 

2.7±0.2 

2.6±0.2 

^^norm2 

^Arc 

C-Bar 

SVM 

Banana 

10.6±0.4 

10.8±0.5 

10.9±0.5 

11.5±0.7 

Bcancer 

26.0±4.4 

25.8±4.6 

25.9±4-4 

26.0±4.7 

Diabetis 

23.6±1.8 

23.7±2.0 

23.7±1.8 

23.5±1.7 

German 

24.1±2.2 

24.4±2.2 

24.3±2.4 

23.6±2.1 

Heart 

17.0±3.1 

16.5±3.5 

17.0±3.4 

16.0±3.3 

Ringnorm 

1.6±0.1 

1.7±0.2 

1.7±0.2 

1.7±0.1 

Fsolar 

34.1±1.7 

34.4±1.9 

33.7±1.9 

32.4±1.8 

Thyroid 

4.4±2.2 

4.4±2.2 

4.5±2.2 

4.8±2.2 

Titanic 

22.5±1.2 

23.0±1.4 

22.4±1.1 

22.4±1.0 

Waveform 

9.5±0.4 

10.0±0.7 

9.7±0.5 

9.9±0.4 

Splice 

9.5±0.5 

N/A 

N/A 

10.9±0.7 

Image 

2.7±0.5 

N/A 

N/A 

3.0±0.6 

Twonorm 

2.7±0.2 

N/A 

N/A 

3.0±0.2 

better  in  most  cases,  which  may  be  explained  as  due  to  a hard  limited  penalty 
function  used  in  the  supporting  optimization  scheme  for  the  u-kxc  and  C- 
Barrier  algorithms. 


• An  interesting  observation  is  that  although  AdaBoost  is  useful  for  the  low  noise 
data  case,  the  results  of  ringnorm,  thyroid  and  twonorm  suggest  that  the  regu- 
larized AdaBoost  are  effective  even  in  the  low  noise  regime.  Moreover,  in  almost 
all  cases,  the  standard  deviations  of  our  regularized  AdaBoost  algorithms  are 
smaller  than  those  of  the  single  RBF  and  AdaBoost  classifiers. 
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Table  3.3;  90%  significant  test  comparing  AdaBoostj^L  with  other  methods.  ‘0’ 
means  the  test  accepts  the  null  hypothesis:  no  significant  difference  in  average  per- 
formance. ‘-I-’  means  the  test  accepts  the  alternative  hypothesis:  AdaBoostj^^ 
significantly  better  and  means  AdaBoostp^^  significantly  worse. 
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n H S S E g s 


Figure  3.2:  USPS  database  of  handwritten  digits  “3”  (top  row)  and  “5”  (bottom 
row)  - a pair  of  classes  that  is  the  most  confused  among  the  ten  digits. 


3.4.2  Experiments  Using  USPS  Dataset 

In  this  experiment,  we  test  the  robustness  of  our  regularized  AdaBoost  algo- 
rithms against  label  noise  and  feature  noise.  Since  two  regularized  algorithms  gave 
similar  performances  in  the  last  experiment,  we  herein  only  focus  on  AdaBoostp^^. 

We  choose  to  validate  our  approach  on  the  problem  of  recognizing  handwrit- 
ten digits.  More  specifically,  we  present  classification  results  using  AdaBoost  and 
AdaBoost  fhe  benchmark  USPS  database,  which  comprises  16  x 16  gray-scale 

images  of  handwritten  digits,  as  illustrated  in  Figure  3.2. 

Considerations  of  multiclass  AdaBoost,  as  required  for  the  recognition  of  ten 
digits,  would  defocus  this  study.  Therefore,  having  carefully  examined  the  USPS 
database,  we  choose  only  two  classes  - namely,  “3”  and  “5”  - which,  in  fact,  is  the 
most  confused  pair  among  the  ten  digits.  Thus,  our  training  and  testing  datasets 
contain  1214  and  326  images  of  the  3’s  and  5’s,  respectively. 

We  choose  C4.5  as  a base  learner  for  the  two-class  AdaBoost  and  AdaBoostp^^. 
C4.5  is  a decision-tree  classifier  with  a long  record  of  successful  implementations  in 
many  classification  systems  [61].  Input  to  C4.5  is  a set  of  labelled  patterns,  each 
consisting  of  a list  of  attribute/value  pairs.  In  our  case,  each  pattern  (i.e.,  digit)  has 
256  attributes,  which  represent  normalized  pixel  values  in  the  interval  [—1,1]  of  the 
corresponding  16  x 16  gray-scale  image. 

As  explained  before,  two  main  causes  of  noisy  training  data  are  mislabeling 
and  feature  (pixel)  noise.  Therefore,  we  carefully  design  two  types  of  experiments 
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that  address  the  two  problems.  First,  we  explain  our  experiments  with  mislabeled 
data. 

3.4.2. 1 Mislabeling 

To  conduct  experiments  with  mislabeling,  we  randomly  select  a certain  num- 
ber of  training  images  and  change  their  class  labels.  The  mislabeling  noise  is  intro- 
duced only  to  the  training,  and  not  to  the  test  set.  Then,  the  ensemble  classifiers  of 
AdaBoost  and  AdaBoost  ^re  evaluated  for  several  different  noise  levels,  ranging 
from  0%  to  24%  of  mislabeled  samples  in  the  total  number  of  the  training  images. 
The  classification  error  on  the  test  images,  averaged  over  15  runs,  is  reported  in  Fig- 
ure 3.3.  From  the  figure,  we  observe  that  AdaBoost  consistently  outperforms 
AdaBoost,  reducing  the  classification  error  by  15%  - 40%  at  all  noise  levels.  For  ex- 
ample, for  9%  and  24%  of  mislabeled  data  the  error  decreases  by  approximately  30% 
and  40%,  respectively.  Moreover,  note  that  even  at  the  0%  level  of  mislabeling  noise, 
that  is,  when  the  original  USPS  dataset  is  used,  AdaBoost  generates  a smaller 
classification  error  than  AdaBoost.  This  means  that  due  to  the  indiscernible  samples 
in  the  two  classes  (see  Figure  3.2),  the  original  USPS  dataset  is  already  corrupted  by 
mislabeling  noise,  against  which  AdaBoostp^L  proves  effective. 

Next,  in  Figure  3.4,  we  plot  the  learning  curves  of  the  two  algorithms  for 
different  levels  of  mislabeling  noise.  From  the  figure,  we  observe  that  AdaBoost 
improves  the  classification  of  the  weak  learner  C4.5,  but  when  regularized,  that  is, 
for  AdaBoost  the  classification  performance  becomes  even  better.  Note  that  in  all 
cases,  for  the  given  parameter  the  regularization  accelerates  the  learning  process. 
For  example,  for  9%  of  mislabeled  data,  it  takes  AdaBoost  three  times  more  iteration 
steps  to  reach  the  same  performance  as  AdaBoost 

The  optimal  ^ values  can  be  estimated  through  cross-validation  on  the  training 
data.  It  is  well-known  that  the  cross-validation  method  usually  results  in  an  estimate 
with  a large  variance.  Fortunately,  this  problem  does  not  represent  a serious  concern. 
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Classification  Error  for  Mislabeled  Data 


Figure  3.3:  Classification  error  of  AdaBoost  and  AdaBoost  the  test  dataset, 

after  250  iterations,  averaged  over  15  runs,  in  the  presence  of  mislabeling  noise, 
ranging  from  0%  to  24%  of  the  total  number  of  the  training  images. 

because  the  performance  of  AdaBoostj^^  on  test  data  is  not  very  sensitive  to  the 
choice  of  /3,  as  illustrated  in  Figure  3.5.  There,  we  plot  the  classification  error  of 
AdaBoost  oii  the  test  images  as  a function  of  /3,  after  250  iteration  steps,  averaged 
over  15  runs,  for  various  levels  of  mislabeling  noise.  From  Figure  3.5,  we  observe  that 
the  error  curve  is  flat  around  minimum  for  a wide  range  of  /3  values.  Therefore,  the 
estimated  13  in  cross-validation  can  be  considered  optimal. 

3. 4. 2. 2 Feature  Noise 

For  experiments  with  feature  noise,  we  add  Gaussian  noise  with  zero  mean 
and  variance  to  both  training  and  test  samples,  as  depicted  in  Figure  3.6.  Here, 
we  conduct  two  types  of  experiments.  In  the  first,  we  add  Gaussian  noise  to  all 
the  training  and  test  images.  The  classification  error  on  the  test  dataset,  averaged 
over  10  runs,  is  plotted  in  Figure  3.7.  Again,  AdaBoost  consistently  outperforms 
AdaBoost  over  all  a values,  yielding  approximately  20%  decrease  in  classification 


error. 
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Classtfication  Error  for  0%  of  Mislabeled  Data 


Classification  Error  for  6%  of  Mislabeled  Data 


Classification  Error  for  24%  of  Mislabeled  Data 


Figure  3.4:  Learning  curves  of  AdaBoost  and  AdaBoost  ^he  presence  of  mis- 

labeling noise.  The  use  of  regularization  accelerates  the  learning  process. 


Classification  Error  For  Mislabeled  Data 


Figure  3.5:  Classification  error  of  AdaBoost  sensitive  to  the  choice  of  p. 

The  error  values  are  obtained  after  250  iterations,  and  averaged  over  15  runs. 
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Figure  3.6;  Samples  of  the  “3”  (top)  and  “5”  (bottom)  classes  corrupted  by  Gaussian 
noise  with  zero  mean  and  a = {0, 0.2, 0.5, 1}. 


In  the  second  experiment,  a certain  number  of  samples  is  first  randomly  se- 
lected from  the  training  and  test  datasets.  Then,  Gaussian  noise  is  added  only  to 
these  images.  We  refer  to  the  number  of  selected  images  as  noise  level,  measured 
in  percentages  of  the  total  number  of  samples  in  the  corresponding  sets.  The  aver- 
aged results  over  10  runs,  after  250  iterations,  are  presented  in  Figure  3.8.  From  the 
figure,  we  observe  that  AdaBoost  reduces  the  classification  error  of  AdaBoost 
by  approximately  10%  — 25%  over  various  noise  levels  and  variances  a.  The  biggest 
gains  are  accomplished  for  small  noise  levels,  where,  for  example,  for  noise  level  of 
5%  and  a=2  the  decrease  in  error  is  approximately  25%. 

Further,  in  Figure  3.9,  we  plot  the  learning  curves  of  the  two  algorithms  for 
different  noise  levels.  Again,  our  algorithm,  for  the  given  parameter  /?,  significantly 
accelerates  the  learning  process.  For  example,  for  noise  level  of  15%  and  a=l,  it 
takes  AdaBoost  three  times  more  iteration  steps  to  reach  the  same  performance  as 
AdaBoost As  for  the  experiments  with  mislabeled  data,  the  optimal  ^ values  can 
be  estimated  through  cross-validation  on  the  training  data. 

3.5  Conclusions 

We  have  made  a detailed  study  on  AdaBoost  and  its  regularization  varia- 
tions. Two  regularized  AdaBoost  algorithms  have  been  proposed.  By  studying  the 
connection  between  the  minimax  optimization  problem  with  the  maximum  margin 
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Feature  Noise  Introduced  to  100%  Percent  of  Data 


Figure  3.7:  Classification  error  of  AdaBoost  and  AdaBoost  ^he  test  dataset, 

after  250  iterations,  averaged  over  10  runs,  when  Gaussian  noise  is  added  to  all  the 
samples. 


Classification  Error  for  Feature  Noise:  Levels5% 


Classification  Error  for  Feature  Noise:  Leveisl0% 


Figure  3.8:  Classification  error  of  AdaBoost  and  AdaBoost  ®n  the  test  dataset, 
after  250  iterations,  averaged  over  10  runs,  when  Gaussian  noise  is  added  to  only  a 
percentage  of  the  test  samples. 
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Class.  Error  for  Feature  Noise:  Levels10%,  Slgmssl 
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Figure  3.9:  Learning  curves  of  AdaBoost  and  AdaBoostp^L  foi"  feature  noise  intro- 
duced into  part  of  data. 


classification  scheme,  we  show  that  the  impressive  generalization  capability  of  Ad- 
aBoost in  the  low  noise  data  cases  may  stem  from  the  fact  that  the  classification 
performance  in  the  worst  case  is  maximized.  It  also  explains  that  the  overfitting 
of  AdaBoost  is  inevitable.  It  is  natural  to  control  the  distribution  skewness  in  the 
learning  process  to  prevent  the  outerlier  samples  from  spoiling  decision  boundaries. 
We  control  the  skewness  by  adding  a convex  penalty  function  to  the  objective  of  the 
minimax  problem.  Based  on  the  generalized  minimax  theorem,  we  have  shown  that 
the  penalty  scheme  can  be  pursued  equivalently  in  the  dual  domain  and  the  LPreg- 
AdaBoost  is  a special  case  of  the  penalty  scheme  with  the  penalty  function  being 
chosen  as  a hard  limited  function.  By  using  two  smooth  convex  penalty  functions, 
two  new  soft  margin  concepts  have  been  defined  and  thereby  two  new  regularized 
AdaBoost  algorithms  have  been  proposed.  The  regularization  is  naturally  incorpo- 
rated into  the  AdaBoost  process  adaptively.  To  demonstrate  the  effectiveness  of  the 
proposed  algorithms,  a large-scale  experiment  has  been  conducted.  Compared  with 
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AdaBoostj^gg,  u-Avc  and  C-Barrier,  our  methods  can  achieve  at  least  the  same  or 
much  better  performances. 

We  have  also  performed  a set  of  carefully  designed  experiments  on  the  bench- 
mark USPS  database  of  handwritten  digits  to  investigate  the  robustness  of  our  reg- 
ularized AdaBoost  algorithm  against  feature  and  label  noise.  The  results  showed 
remarkable  improvements  over  the  original  AdaBoost.  In  the  presence  of  mislabeling 
noise,  AdaBoostj^^  consistently  decreases  classification  error  by  approximately  40% 
at  all  noise  levels.  Moreover,  for  the  different  degrees  of  Gaussian  noise  added  to 
the  training  images,  AdaBoostp^^  outperforms  AdaBoost  by  10%  — 25%.  In  addi- 
tion, AdaBoostp^L  significantly  accelerates  the  learning  process,  reaching  the  same 
performance  as  AdaBoost  in  an  approximately  2—3  times  smaller  number  of  itera- 
tion steps.  We  also  experimentally  found  that  the  performance  of  AdaBoost 
not  very  sensitive  to  the  estimation  of  0,  which  makes  model  selection  easy  in  real 
applications. 

Unlike  the  commonly  used  regularization  scheme,  Equation  (3.45)  suggests 
a new  idea  of  adaptive  regularization.  The  similar  idea  may  also  be  used  in  the 
regression  scenario:  the  training  samples,  those  always  having  large  residue  errors 
and  therefore  frequently  visited  by  the  base  learning  algorithm,  should  have  less 
influence  on  the  outcome  of  the  final  ensemble  function.  This  is  the  subject  of  our 
future  research. 


CHAPTER  4 

LINEAR  PROGRAMING  BASED  ADABOOST 

4.1  Introduction 

In  the  last  chapter,  we  have  proposed  two  regularized  AdaBoost  algorithms, 
referred  to  as  AdaBoost AdaBoostnorm2-  Although  these  algorithms  per- 
formed very  well  in  our  experiments,  like  AdaBoost  itself,  a strict  proof  of  these 
algorithms  achieving  the  optimum  solutions  for  the  associated  supporting  optimiza- 
tion problems  is  still  missing.  This  problem  can  be  overcome  by  considering  using 
column  generation  to  obtain  the  query  distributions  and  iteratively  solve  the  problem 
in  (3.25). 

The  column  generation  (GG)  method  [51]  has  been  known  since  1950s  in  the 
field  of  numerical  optimization  and  was  first  introduced  into  boosting  by  Demiriz  et. 
al  [19].  The  basic  idea  of  GG  is  that  instead  of  explicitly  solving  the  optimization 
problem  with  the  entire  set  of  H,  GG  restricts  the  dual  problem  by  only  considering 
the  hypothesis  functions  generated  so  far  and  uses  the  base  learning  algorithm  as  an 
“oracle”  to  generate  a new  hypothesis  or  column.  (It  is  this  reason  that  this  method 
is  called  column  generation.)  This  continues  until  there  is  no  hypothesis  function 
within  H.  that  violates  the  constraints.  This  method  is  known  to  converge  for  finite 
and  infinite  function  sets.  However,  as  we  have  pointed  out  in  the  last  chapter,  GG 
usually  exhibits  a slow  convergence  due  to  the  degeneracy  of  (3.25).  Particularly,  in 
the  first  few  iterations,  GG  can  produce  a very  spare  solution  of  d.  A sparse  solution 
implies  that  only  a few  data  samples  will  participate  into  the  learning  process  in  the 
next  iteration,  which  not  only  makes  the  learning  difficult  but  also  leads  to  a slow 
convergence.  One  possible  remedy  to  overcome  slow  convergence  is  to  constrain  the 
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possible  query  distribution  into  a restricted  area  at  each  iteration.  From  the  compu- 
tational aspect,  we  prefer  the  resulting  optimization  problem  to  be  an  LP  problem. 
One  simple  yet  effective  approach  is  the  BOXSTEP  method  [45],  which  constrains  the 
current  optimal  solution  into  a box,  measured  based  the  norm-oo  distance,  centered 
on  the  previous  solution.  In  this  fashion,  we  transform  the  original  complicated  piece- 
wise  convex  problem  into  a set  of  simple  LP  problems.  The  proposed  algorithm  which 
we  call  LPnorm2-AdaBoost  can  be  shown  to  converge  in  a finite  number  of  iterations. 
A large-scale  experiment  is  performed  showing  that  the  newly  proposed  algorithm 
can  achieve  a slightly  better  overall  classification  performance  than  AdaBoostj^gg. 
In  some  cases,  significant  improvements  are  observed. 

The  remainder  of  the  chapter  is  organized  as  follows.  First,  in  Section  4.2  we 
give  a brief  review  of  the  LPreg-AdaBoost  algorithm  and  its  CG  based  implemen- 
tation. Then  we  propose  a new  algorithm,  referred  to  as  LPnorma-AdaBoost.  To 
validate  our  proposed  algorithm,  in  Section  4.3,  a large-scale  experiment  based  on 
several  artificial  and  real-world  datasets  are  conducted  and  the  results  are  reported. 
We  finally  conclude  the  chapter  in  Section  4.4  with  our  remarks. 

4.2  Regularized  LP  Boosting  Algorithms 
4.2.1  LPreg-AdaBoost  (LPRA)  and  Column  Generation 

In  this  subsection,  we  describe  the  LPreg-AdaBoost  algorithm  and  its  column 
generation  based  implementation. 

By  constraining  the  distribution  into  a box,  i.e.,  d < Cl,  we  get  the  following 
optimization  problem: 

max  min  d^Za  (4.1) 

aerial  {der^,d<ci} 

where  \%  \ is  the  cardinality  of  the  hypothesis  function  set,  C is  a predefined  parameter 
and  1 G is  a vector  of  all  ones.  After  some  mathematical  manipulations,  one  can 
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derive  the  primal  optimization  problem  as: 

m3-X(p,A,a)  P ~ C Xyn=l 

subject  to  >p-K,  n = l,---,N  . . 

A„>0,  n=l,---,N  ^ ^ 

a G rl’^l 

It  is  a linear  programming  problem  which  can  be  solved  very  efficiently  by  using 
an  LP  package.  However,  in  practical  applications,  the  cardinality  of  the  hypothesis 
function  set  can  be  very  large  or  even  infinite.  In  other  words,  the  gain  matrix  Z does 
not  exist  in  an  explicit  form  and  hence  linear  programming  cannot  be  implemented 
directly.  This  difficulty  can  be  circumvented  by  using  the  column  generation  (CG) 
technique  [19].  The  basic  idea  of  CG  is  to  iteratively  construct  the  optimal  ensemble 
for  a restricted  subset  of  the  hypothesis  set,  which  is  iteratively  extended.  In  order 
to  use  the  CG  technique,  we  first  transform  (4.2)  into  its  dual  form: 
miu(^,d)  7 

subject  to  J2n=i  dnZnj  <7,  i = 1, ' ' ‘ (4-3) 

d < c:i,d  G 

At  the  iteration,  the  problem  (4.3)  is  solved  for  a small  subset  of  hypotheses 
^.1  € 'H.  Denote  the  optimal  solutions  of  d and  7 as  d^‘+^)  and  7!*+^,  respectively. 
Then  the  base  learner  is  trained  with  respect  to  d^‘+^^  and  a hypothesis  function  ht+i 
is  found  as: 

N 

ht+i  argm^  E 4‘^^)/i(x„)y„  (4.4) 

n=l 

If  the  resulting  ht+i  violates  the  constraint  ^ 7^*+^),  then  it  is 

added  to  the  restricted  problem  and  the  process  is  continued.  This  corresponds  to 
generating  a column  in  the  gain  matrix.  If  ht+i  satisfies  the  constraint,  then  the 
current  dual  variables  and  the  ensemble  are  optimal,  as  all  constraints  of  the  full 
master  problem  are  fulfilled. 

The  CG  method  was  first  introduced  into  boosting  by  Demiriz  et.  al  [19].  In 
their  experiment,  it  was  shown  that  by  using  a commercialized  linear  programming 
package  CPLEX,  the  CG  based  LPreg-AdaBoost  algorithm  can  achieve  a comparable 
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performance  to  AdaBoost  with  respect  to  both  classification  quality  and  computation 
time. 

4,2,2  LPjjQj.jjj2"AdaBoost  (LPNA) 

As  we  have  shown  in  the  last  chapter,  in  order  to  control  the  skewness  of 
the  distribution,  one  more  general  strategy  is  to  add  a penalty  term  P(d),  which 
measures  the  distances  between  the  query  distributions  and  the  distribution  center, 
to  the  cost  function  of  (3.1).  It  leads  to  the  following  optimization  problem: 

max  min  d^Za  + ^Pid)  (4.5) 

where  p is  a predefined  parameter  controlling  the  distribution  skewness  and  the 
training  performance.  As  shown  in  Section  3.2.4,  (4.5)  includes  LPRA  as  a special 
case. 

One  commonly  used  distance  metric  is  the  Ip  norm  function.  It  is  easy  to 
show  that  ||d  - do||p  is  a convex  function  of  d and  following  Lemma  3.2.1,  we  get  the 
following  regularized  scheme  in  the  dual  domain: 
min(^,d)  7 + /3||d  - dollp 

subject  to  dnZnj  <7,  i = 1,  • • • , 1^1  (4.6) 

d G 

Particularly,  in  this  chapter,  we  only  focus  on  the  case  of  I2.  However,  the  following 
derivation  can  be  easily  extended  to  other  penalty  functions.  The  above  optimization 
problem  can  then  be  reformulated  as: 
min(^,d)  7 

subject  to  +/5||d  - do||2  < 7,  i = (4-7) 

d G P^ 

Following  the  derivations  in  the  last  chapter,  the  convex  optimization  problem 
(4.7)  can  be  linearly  approximated  as: 

min(^,d)  7 

subject  to  z^d  <7,  t = 1,  • • • , T 

d G 


(4.8) 
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This  is  a linear  programming  that  is  easier  to  deal  with  than  the  original  problem 
in  (4.7).  However,  this  is  only  a linear  approximation  that  gets  better  as  more  con- 
straints are  added.  The  query  distributions  can  be  obtained  through  the  column 
generation  technique  and  the  finite  convergence  of  the  optimization  problem  (4.7) 
can  be  guaranteed.  However,  column  generation  usually  shows  a pattern  of  slow 
convergence  due  to  the  degeneracy  of  (4.8).  In  [19]  column  generation  was  used  for 
implementing  LPreg-AdaBoost.  The  problem  of  the  slow  convergence  was  allevi- 
ated by  setting  a lower  bound  for  each  query  distribution,  i.e.,  C/1  < < Cl 

with  Cl  C and  consequently  the  possible  query  distributions  are  constrained  into 
an  even  smaller  area.  In  our  case  of  d G , other  more  sophisticated  stabilized 
column  generation  techniques  [52]  may  be  needed.  Note  that  the  problem  of  slow 
convergence,  in  particular  in  the  first  several  iterations,  is  due  to  the  spareness  of 
the  optimum  solution  produced  in  each  iteration,  which  not  only  makes  the  learning 
process  difficult  but  also  produces  many  unnecessary  columns.  The  slow  convergence 
problem  is  illustrated  in  Figure  4.1.  The  numbers  in  the  circle  are  the  sequences  of 
generating  the  columns  or  constraints.  Obviously,  given  the  columns  of  1 and  4,  the 
columns  (constraints)  of  2 and  3 will  not  be  activated.  Based  on  the  Karush-Kuhu- 
Tucker  condition,  the  corresponding  hypothesis  coefficients  a2  and  0:3  are  equal  to 
zero.  That  is,  the  generation  of  hi  and  /12  is  not  necessary.  One  natural  idea  to 
alleviate  the  problem  is  to  constrain  the  solution  into  a box  centered  on  the  previous 
solution,  also  called  the  BOXSTEP  method  [45]: 

min(^,d)  7 

subject  to  z'^d  <7,  t = 1,  • • • , T (4.9) 

der^,|ld-d(^)|U<s 

where  5 is  a parameter  defining  the  box  size.  The  above  optimization  problem  can 
be  further  simplified  as  an  LP  problem  by  noting  that: 

||d  - d^^^lloo  = max  \d{n)  - d^^^(n)|  < B ^ d^^^  - IB  < d < IB  + d^^^  (4-10) 


59 


© 


Y 


©■ 


© 


© 


d 


Figure  4.1:  The  slow  convergence  problem  of  colnmn  generation.  The  numbers  in 
the  circle  are  the  sequences  of  generating  the  columns  or  constraints.  Obviously, 
given  the  columns  of  1 and  4,  the  columns  of  2 and  3 will  not  be  activated,  i.e.,  the 
corresponding  hypothesis  coefficients  a2  and  0:3  are  equal  to  zero. 

Combining  the  constraint  of  d G F^  with  (4.10),  we  get: 


Based  on  (4.12),  a LP  based  boosting  algorithm,  referred  to  LPj^Qj.jjj2‘AdaBoost 
(LPNA),  can  be  formulated: 


max{d^^^  - IB,  0}  < d < IB  + d^^^ 


(4.11) 


Then  the  problem  becomes: 


min(.y^d)  7 

subject  to  z^d  < 7,  t = 


(4.12) 


60 


Initialization:  V = {(x„, Maximum  iteration  number  T, 
S-^\n)  = \/N,  parameter  j3,  Box  size  B 


for  t = \ :T 


1.  Train  base  learner  with  respect  to  distribution  and  get  hypothesis 
/i((x)  : X {±1}. 

2.  Solve  the  optimization  problem: 


(d*,7*)=  argmin(^,d)  7 

subject  to  z^d  <7,  j = 1,  • • • , t 

max{d(*)  — IB,  0}  < d < IB  + d^*^ 

(4.13) 

4.  Update  weights  as  = d*. 


end 


Output  : F(x)  = 

where  ot*  is  the  Lagrangian  multipliers  from  the  last  LP. 


Directly  following  from  Theorem  4.5.17  and  Corollary  4.5.19  in  [52],  LPNA 
can  be  proven  to  converge  to  an  optimum  solution  after  a finite  number  of  iterations. 

We  make  a brief  discussion  on  some  implementation  issues  of  LPNA.  One 
important  parameter  of  LPNA  is  the  box  size.  If  B is  chosen  too  large,  the  algo- 
rithm will  still  suffer  from  a slow  convergence  problem;  if  B is  too  small,  the  query 
distribution  is  not  updated  sufficiently  in  each  iteration,  which  can  also  affect  the 
convergence  rate  of  the  algorithm.  In  our  experiment,  we  find  that  B e is 

appreciate  and  in  all  our  experiments  we  choose  B = 
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4.3  Experimental  Results 

To  demonstrate  the  effectiveness  of  the  newly  proposed  algorithm,  a large- 
scale  experiment  is  conducted  and  the  results  are  compared  with  the  RBF,  AdaBoost, 
AdaBoostj^gg,  SVM  and  LPRA  algorithms. 

We  use  12  artificial  and  real-world  datasets  originally  from  the  UCI,  DELVE 
and  STATLOG  benchmark  repositions:  banana,  waveform,  diabetis,  breast  cancer, 
ringnorm,  twonorm,  splice,  heart,  german,  titanic,  thyroid  and  flare  solar.  These  data 
were  used  in  the  last  chapter.  We  drop  the  image  dataset  due  to  its  large  problem 
size. 

The  RBF  net  is  used  as  the  base  learner.  All  of  the  RBF  parameters,  including 
the  number  of  the  RBF  centers  and  iteration  number,  are  the  same  as  those  used  in 
[63].  These  parameters  as  well  as  the  regularization  parameter  /?  are  well  tuned  via 
the  cross-validation  method.  Throughout  the  experiment,  the  maximum  iteration 
number  of  LENA  is  chosen  to  be  150.  We  use  the  commercialized  optimization 
package  XPRESS  as  the  LP  solver. 

We  present  several  examples  to  illustrate  the  proposed  algorithm. 

1.  The  feature  dimensionality  of  the  banana  dataset  is  2.  We  plot  the  decision 
boundaries  of  RBF,  AdaBoost  and  LPNA  in  the  feature  space  in  Figure  4.2. 
AdaBoost  tries  to  classify  each  pattern  according  to  its  associated  label  and 
forms  a zigzag  decision  boundary,  which  gives  a straightforward  illustration  of 
the  overfitting  phenomenon  of  AdaBoost.  Both  RBF  and  LPNA  give  smooth 
and  similar  decision  boundaries.  It  is  difficult  to  determine  which  decision 
boundary  is  better  visually  and  where  to  further  improve  the  decision  bound- 
ary of  RBF  given  these  training  samples.  This  indicates  that  RBF  with  the 
well  tuned  structural  parameters  is  a strong  classifier.  It  is  not  surprising  that 
boosting  such  a strong  classifier  without  any  regularization  will  lead  to  severe 
overfitting.  For  this  particular  realization,  the  testing  errors  of  RBF,  AdaBoost 
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and  LPNA  are  10.3%,  11.8%  and  9.9%,  respectively.  The  small  visual  differ- 
ences in  the  decision  boundaries  of  RBF  and  LPNA  are  also  reflected  in  the 
average  classification  results  (Table  4.1).  However,  LPNA  reduces  the  standard 
deviations  of  the  RBF  results  by  about  30%. 

2.  In  the  second  example,  we  present  some  training  and  testing  results  and  margin 
plots  of  AdaBoost  and  LPNA  based  on  one  realization  of  the  waveform  data  in 
Figure  4.3.  AdaBoost  tries  to  maximize  the  margin  of  each  pattern  and  hence 
it  can  reduce  the  training  error  to  zero  effectively.  However  it  quickly  leads  to 
overfitting.  In  contrast,  LPNA  tries  to  maximize  the  soft  margin  and  allows 
some  hardest  examples  to  have  a small  margin.  LPNA  can  effectively  alleviate 
the  overfitting  problem. 

To  provide  a more  comprehensive  comparison,  in  Table  4.1,  the  average  clas- 
sification results  (standard  deviations)  over  the  100  realizations  of  the  dataset  are 
presented.  We  note  that: 

1.  AdaBoost  performs  worse  than  a single  RBF  classifier  in  almost  all  cases.  This 
is  due  to  the  overfitting  of  AdaBoost.  In  many  cases  AdaBoost  quickly  leads 
to  overfitting  only  after  a few  iterations,  which  clearly  indicates  that  a regular- 
ization is  needed  for  AdaBoost. 

2.  LPNA  can  significantly  improve  the  performance  of  RBF  while  avoiding  the 
overfitting  problem.  In  10  (out  of  12)  cases,  LPNA  performs  better  than  Ad- 
aBoost and  in  10  cases  LPNA  performs  better  than  a single  RBF  classifier. 

3.  We  compare  our  algorithm  with  AdaBoostpj^gg,  which  is  considered  having  the 
best  performance  on  the  same  dataset.  For  some  datasets,  the  performance 
differences  are  small  (e.g.,  titanic,  twonorm).  This  is  because  AdaBoostp^gg 
is  already  a good  classifier  which  was  reported  to  slightly  outperform  SVM 
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[63].  Nevertheless,  significant  improvements  are  observed  in  some  datasets 
(e.g.,  waveform).  An  overall  comparison  between  six  classifiers  are  also  pre- 
sented based  on  the  comparison  method  used  in  [63].  The  last  line  in  the  table 
“Mean%”  is  computed  as  follows:  for  each  dataset  the  average  error  rates  of 
six  classifiers  are  divided  by  the  minimum  error  rate  of  that  dataset  and  then 
1 is  subtracted  out.  The  resulting  numbers  are  averaged  and  the  variance  is 
computed.  Both  LPNA  and  AdaBoostj^gg  perform  better  than  SVM. 

4.  We  also  make  a comparison  between  LPRA  and  LPNA.  Theoretically  speaking, 
by  choosing  C = 1/N,  LPRA  returns  one  classifier  which  is  the  same  as  the 
one  found  in  the  first  iteration.  However,  this  makes  the  boosting  meaningless. 
Therefore,  we  search  for  the  best  C value  in  [1/0. 9 A’,  1/0. 05 A].  Although  LPRA 
can  achieve  better  performance  than  AdaBoost,  in  almost  all  cases  LPRA  is 
inferior  to  LPNA.  This  may  be  explained  as  due  to  a hard  limited  penalty 
function  used  in  LPRA. 

5.  The  proposed  algorithm  can  be  readily  extended  to  the  case  where  the  base 
learner  produces  a soft  decision.  The  soft  decision  of  RBF  is  computed  as 
follows: 

where  z ^ % is  the  output  of  RBF  given  the  pattern  x.  The  classification 
results  of  LPNA  using  soft  information  (LPNA(S))  are  reported  in  Table  4.2. 
By  exploiting  the  confidence  information  LPNA(S)  performs  slightly  better 
than  LPNA  with  hard  decisions  (LPNA(H)).  Moreover,  in  the  experiment,  we 
find  that  LPNA(S)  usually  converges  faster  than  LPNA(H)  in  terms  of  the 
classification  performance. 
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Table  4.1:  Classification  Errors  (standard  deviations)  of  the  six  methods:  RBF,  Ad- 
aBoost(AB),  AdaBoostj^gg(ARH))  LPreg-AdaBoost  (LPRA),  LPjjQj.jjj2'AdaBoost 
(LPNA)  and  SVM.  The  best  classifiers  are  marked  in  boldface. 


RBF  [63] 

AB  [63] 

ABr  [63] 

LPNA 

LPRA 

SVM  [63] 

Banana 

10.8T0.6 

12.3±0.7 

10.9±0.4 

10.7T0.4 

10.9±0.9 

11.5±0.7 

Bcancer 

27.6±4.7 

30.4T4.7 

26.5T4.5 

25.9T4.5 

26.7±4.7 

26.0±4.7 

Diabetis 

24.3T1.9 

26.5±2.3 

23.8T1.8 

23.8T1.8 

24.3±2.0 

23.5T1.7 

German 

24.7±2.4 

27.5±2.5 

24.3T2.1 

23.9±2.3 

24.5±2.3 

23.6±2.1 

Heart 

17.6±3.3 

20.3T3.4 

16.5±3.5 

16.9±3.2 

17.5±3.6 

16.0±3.3 

Ringnorm 

1.7±0.2 

1.9±0.3 

1.6T0.1 

1.6±0.2 

1.7±0.2 

1.7±0.1 

Fsolar 

34.4±2.0 

35.7±1.8 

34.2±2.2 

34.3=fcl.8 

34.6±2.0 

32.4T1.8 

Thyroid 

4.5±2.1 

44±2.2 

4.6±2.2 

4.3±2.2 

4.4±2.1 

4.8T2.2 

Titanic 

23.3±1.3 

22.6T1.2 

22.6±1.2 

22.5±1.1 

23.3±0.9 

22.4T1.0 

Waveform 

10.7±1.1 

10.8T0.6 

9.8T0.8 

9.4±0.4 

9.8±0.5 

9.9T0.4 

Splice 

lO.OTl.O 

10.1±0.5 

9.5±0.7 

9.4T0.7 

9.5±0.6 

10.9T0.7 

Twonorm 

2.9±0.3 

3.0T0.3 

2.7T0.2 

2.7T0.2 

2.8T0.2 

3.0±0.2 

Mean  % 

5.9±3.0 

13.3T7.4 

2.4±2.1 

1.6±2.3 

4.8±2.8 

4.7±5.6 

4.4  Conclusions 

In  this  chapter,  we  have  proposed  a new  regularized  AdaBoost  algorithm, 
which  was  directly  motivated  by  the  connection  between  AdaBoost  and  linear  pro- 
gramming. The  algorithm  is  based  on  an  intuitive  idea  of  controlling  the  distribution 
skewness  in  the  learning  process  to  prevent  the  outlier  samples  from  spoiling  deci- 
sion boundaries  by  introducing  a smooth  convex  penalty  function  into  the  objective 
of  the  minimax  problem.  We  used  the  stabilized  column  generation  technique  to 
transform  the  optimization  problem  into  a simple  linear  programming  problem.  A 
large  scale  experiment  showed  that  the  newly  proposed  algorithm  can  effectively  al- 
leviate the  overfitting  problem  and  can  achieve  a slightly  better  overall  classification 
performance  than  AdaBoostj^^g.  Moreover,  our  algorithm  ha.s  a clear  underlying 
optimization  scheme  and  can  be  proven  to  converge  in  a finite  number  of  iterations. 
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Decision  Boundary  (LPNA) 


Figure  4.2:  The  decision  boundaries  of  three  methods:  RBF,  AdaBoost  and  LPNA 
based  on  one  realization  of  the  banana  data.  AdaBoost  tries  to  classify  each  pattern 
according  to  its  associated  label  and  forms  a zigzag  decision  boundary,  which  gives  a 
straightforward  illustration  of  the  overfitting  phenomenon  of  AdaBoost.  Both  RBF 
and  LPNA  give  smooth  and  similar  decision  boundaries. 
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Results(LPNA)  Margin  (LPNA) 


(a) 


(b) 


Results(AdaBoost)  Margin(AdaBoost) 


(c) 


(d) 


Figure  4.3:  Training  and  testing  results,  and  margin  plots  of  two  methods:  AdaBoost 
and  LPNA  based  on  one  realization  of  the  waveform  data.  AdaBoost  can  effectively 
reduce  the  training  error  to  zero  but  leads  to  overfitting  only  after  a few  iterations. 
The  regularized  method  effectively  alleviates  the  overfitting  problem. 
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Table  4.2:  Classification  Errors  (standard  deviations)  of  LPNA(H)  and  LPNA(S), 
which  use  hard  and  soft  decisions,  respectively. 


LPNA(H) 

LPNA(S) 

Banana 

10.7±0.4 

10.6±0.4 

Bcancer 

25.9±4.5 

26.5±4.6 

Diabetis 

23.8±1.8 

23.8±1.8 

German 

23.9±2.3 

24.0±2.2 

Heart 

16.9±3.2 

17.0±3.1 

Ringnorm 

1.6±0.2 

1.6±0.2 

Fsolar 

34.3±1.8 

34.2±2.0 

Thyroid 

4.3±2.2 

4.2±2.2 

Titanic 

22.5±1.1 

22.5±1.1 

Waveform 

9.4±0.4 

9.3±0.4 

Splice 

9.4±0.7 

9.3±0.7 

Twonorm 

2.7±0.2 

2.7±0.2 

Testing  Results(german)  Testing  Results(waveform) 


(a) 


(b) 


Figure  4.4:  Testing  results  of  LPNA(H)  and  LPNA(S)  averaged  over  100  realizations 
of  the  german  and  waveform  datasets.  LPNA(S)  converges  faster  than  LPNA(H)  in 
terms  of  the  classification  performance. 


CHAPTER  5 

REGULARIZED  MULTICLASS  ADABOOST 

5.1  Introduction 

In  Chapters  3 and  4,  we  have  developed  several  regularized  AdaBoost  algo- 
rithms for  binary  classification  problems.  In  this  chapter  we  will  extend  our  deriva- 
tions to  multiclass  problems. 

A multiclass  classifier  assigns  class  labels  to  patterns  where  labels  are  drawn 
from  a finite  set  of  classes  y = with  C > 2.  For  example,  the  goal  of 

an  optical  character  recognition  system  is  to  determine  the  digit  values  {0,  •••,9} 
from  its  image.  In  designing  a classifier,  it  is  usually  easier  to  devise  algorithms 
for  distinguishing  between  two  classes.  Some  binary  classification  algorithms  can 
be  naturally  extended  to  multiclass  problems,  such  as  C4.5  and  neural  networks, 
while  for  other  algorithms,  the  extension  is  not  straightforward.  One  commonly  used 
approach  is  to  first  decompose  a multiclass  problem  into  several  binary  problems 
based  on  a code  matrix  where  each  class  is  represented  by  a code  word.  One  simple 
example  is  “One  against  All”  code  scheme  where  C binary  classifiers  are  trained 
with  each  classifier  distinguishing  one  class  from  the  other  classes,  where  C is  the 
number  of  classes.  Hastie  and  Tibshirani  [34]  suggested  a different  approach  where 
each  classifier  is  trained  on  one  pair  of  classes  and  simply  ignores  the  others.  It  is  the 
so-called  “All  Pair”  scheme.  Dietterich  and  Bakiri  [22]  presented  a general  framework 
in  which  the  code  matrix  is  constructed  using  error-correcting  codes  (ECO).  For  all  of 
these  methods,  after  the  binary  classification  problems  have  been  solved,  the  resulting 
set  of  binary  classifiers  must  be  combined  in  some  ways. 
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Therefore,  when  dealing  with  a multiclass  problem,  in  addition  to  classifier 
training,  it  also  involves  the  design  of  a code  matrix.  Crame  and  Singer  [18]  summa- 
rized the  approaches  to  a multiclass  problem  into  three  categories; 

(1)  Given  a set  of  binary  classifiers,  find  a code  matrix  which  has  small  empirical 
loss; 

(2)  Given  a code  matrix,  find  a set  of  binary  classifiers  which  has  small  empirical 
loss; 

(3)  Find  a set  of  binary  classifiers  and  a code  matrix  simultaneously  which  has 
small  empirical  loss. 

We  are  not  interested  in  the  first  category  since  we  usually  do  not  know  binary 
classifiers  in  advance.  Most  of  the  existing  algorithms  belong  to  the  second  one.  The 
main  drawback  of  these  approaches  is  that  a predefined  code  matrix  fails  to  address 
the  dependence  between  a code  matrix  and  the  class  of  hypothesis  functions  used 
to  construct  the  binary  classifiers  as  well  as  the  specific  application.  In  this  sense, 
finding  the  binary  classifiers  and  a code  matrix  simultaneously  appears  to  be  the 
best  choice.  Unfortunately,  this  problem  has  been  shown  to  be  NP-hard.  A question 
arises  naturally:  can  we  find  a way  which  overcomes  the  drawback  of  a predefined 
code  matrix  while  avoiding  the  need  to  solve  an  NP-hard  problem? 

In  this  chapter,  we  are  particularly  interested  in  using  AdaBoost  for  multiclass 
problems.  We  investigate  two  multiclass  AdaBoost  algorithms:  (1)  AdaBoost. MO 
which  uses  a predefined  code  matrix,  and  (2)  AdaBoost. EGG  which  alternatively 
generates  columns  of  a code  matrix  and  hypothesis  functions.  In  this  sense,  Ad- 
aBoost.EGG  can  be  considered  as  an  alternative  approach  to  exploiting  the  underlying 
dependence  between  a code  matrix  and  the  hypothesis  functions  under  consideration. 
Therefore,  some  in-depth  theoretical  studies  of  AdaBoost. EGG  are  warranted. 
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As  we  have  seen  from  the  previous  chapters,  the  performance  of  AdaBoost 
can  degrade  dramatically  with  the  presence  of  noise.  Very  little  work  has  been  done 
to  improve  the  robustness  of  multiclass  AdaBoost  against  noise.  In  this  chapter,  we 
extend  our  derivations  to  multiclass  problems  and  propose  two  new  regularization 
algorithms  based  on  AdaBoost. MO  and  AdaBoost. ECC. 

5.2  AdaBoost. MO  and  Regularized  AdaBoost. MO 
5.2.1  AdBoost.MO 

Suppose  we  have  a training  dataset  V = G A x V where  X 

is  the  pattern  space  and  V = is  the  label  space.  We  assume  that  a 

code  matrix  M G {±1}^^^  is  given  where  L is  the  length  of  a code  word.  For 
example,  for  two  commonly  used  code  schemes:  “One  against  All”  and  “All  Pair” , L 
is  equal  to  C and  respectively.  AdaBoost. MO  [67]  works  in  rounds.  In  each 

iteration,  t = 1,  • • • , T,  a distribution  G is  generated  over  pairs  of  training 
examples  and  columns  of  the  matrix  M.  (Here  we  denote  as  a distribution  simplex 
= [D{n,  1)  : Xlf=i  D{n,  1)  = 1,  D{n,  /)  > 0,  n = 1,  • • • TV,  / = 1,  • • • , L}.) 

A set  of  base  learners  is  trained  with  respect  to  the  distribution 

based  on  the  binary  partition  defined  by  each  column  of  M.  Then  the  error  ct  of 
is  computed: 

N L 

'.  = EE  T>(‘)(n,/)I(M(j/„,/)  / hf^(x„))  (5.1) 

n=l  i=l 

Analogous  to  the  two-class  AdaBoost,  the  combination  coefficient  dj  is  computed  as: 


Then  the  distribution  is  updated  as: 

= T>W(„^;)exp(-d(M(y„,/)/if^(x„))/C«  (5.3) 

where  Ct  is  a normalization  constant  such  that  is  a distribution. 
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After  T rounds,  this  procedure  outputs  a final  classifier: 


F(x)  = [fW(x),...,F<«(x)|  (5.4) 

where  ■ When  presented  with  an  unseen  sample  x,  Ad- 

aBoost.MO  predicts  with  the  label  y*  whose  row  M{y*)  is  “closest”  to  the  prediction 
F(x)  based  on  a decoding  strategy.  Two  typical  decoding  strategies  are:  the  Ham- 
ming distance  decoding: 


!,-  = argmm^( 


1 — sign(M(?/,  /)F^')(x)) 


(5.5) 


and  the  loss  based  decoding: 


L 

y*  = argminY^  exp  (-M{y,l)F^^'>{x))  (5.6) 

ysy  ^ — J' 

AdaBoost.MO  is  summarized  as  follows: 


AdaBoost.MO 


Initialization:  V = {(x„,  y„)}()Lj  G AxT,  Maximum  iteration  number  T,  D^^^{n,l)  = 
1/NL 


for  t — 1 :T 

1.  Train  weak  learner  with  respect  to  distribution  and  get  a set  of 
hypotheses  {/it(x)},^^ 

2.  Calculate  the  weighted  training  error  et  of  {fif(x)}^^j: 

N L 

'<  = E E ‘W^(Vn,  0 # a1'’(x„))  (5.7) 

n=l  /=1 

where  I(-)  is  the  indicator  function. 

3.  Compute  the  combination  coefficient: 


1 - 


(5.8) 


72 


4.  Update  weights: 

= U>W(n,/)exp(-a,M(y„,/)/if)(x„))/a  (5.9) 

where  Q is  a normalization  constant  such  that  is  a distribution. 

end 

Output  : F(x)  = [F(^)(x),  • • • , F^^\x)] 


In  the  following,  we  prove  that  AdaBoost.MO  implements  a stage- wise  func- 
tion gradient  descent  procedure. 

Theorem  5.2.1  AdaBoost.MO  implements  a stage-wise  functional  gradient  descent 
procedure  on  the  cost  function: 


l—l  n=l 
L N 


= EE  exp(-F<‘>(x„)M(j/„,()) 


(5.10) 


l=l  n=l 


Proof: 


The  proof  is  a direct  extension  of  the  counterpart  of  the  two-class  AdaBoost.  In  the 
t^^  iteration,  define  the  negative  functional  gradient  of  E with  respect  to  (x)  as: 

(i)  J 0 if  X 7^  x„ 

- I JW'(!/n,0e='P  (_Af(!/«.0.fi-)i(Xn))  ifx  = x„.  n = l,---,N 

(5.11) 

which  is  the  direction  in  which  the  cost  function  most  rapidly  decreases.  Since  we 
are  restricted  to  choosing  our  new  function  from  y.,  we  search  for  an  such 
that  the  inner  product: 

N 

{-VE,h?)  = ^exp(-M(y„,/)FS(x„))M(!/„,0/i!'>(x„) 


n=l 

N 


N 


= exp(-M(y„,  /)f/!\(x„))  ^ 


exp(-M(j/„,/)F/!\(x„)) 


n=l 


^ Eili  exp{-M{yi,  /)F/!\(xi)) 


M{yn,l)h[^\xn) 


(5.12) 
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is  maximized.  Define: 


Elli  exp(-F/!\  (xi))M(yi,  0) 


(5.13) 


It  follows  that  /if  ^ should  be  selected  to  minimize  the  training  error  with  respect  to 
df\  i.e., 


1.(0 

ni  = arg  max  el 


(0 


hen 


N 


= arg  m^  V d? (n)I(/i(x„)  7^  /)) 

n6rt  ' 


(5.14) 


n=l 


After  {/if^}^i  are  found,  the  combination  coefficient  dt  is  computed  to  minimize  the 
intermediate  cost  function  Ef. 

L N 

Et  = ^ ^exp  (-(F/!\(x„)  + d</if^(x„))M(y„,  l)\  (5.15) 

(=1  n=l 

Taking  the  derivative  of  Et  with  respect  to  dt  and  setting  it  to  zero  yields: 
dEt  ^ ^ 


ddt 


^^exp  (^-(f/!\(x„)  + dthf\xn))M{yn,l)'j  hf{xn)M{yn,l) 


1=1  n=l 
0 


(5.16) 


With  the  distribution  update  rule  defined  in  Equation  (5.3)  and  following  the  same 
derivation  procedure  as  AdaBoost,  it  is  easy  to  show: 

Therefore,  AdaBoost. MO  can  be  viewed  as  a stage-wise  forward  gradient  pro- 
cedure on  the  cost  function  of  margin  distribution.  The  only  difference,  compared 
with  AdaBoost,  is  that  the  summation  in  (5.10)  is  not  only  over  the  data  index  but 
also  over  the  code  matrix  column  index.  Hence,  the  margin  theory  presented  in  [70] 
can  be  directly  applied  to  AdaBoost. MO  with  little  modifications:  AdaBoost. MO 
tries  to  maximize  the  margin  at  each  bit  position: 


max  p^^Hx„) 


(5.18) 
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Although  it  may  not  be  the  best  way  to  define  a margin  as  above  for  a multiclass 
problem,  it  is  sensible.  The  problem  of  maximizing  the  margin  can  be  solved  exactly 
by  solving  the  following  linear  programming  problem: 

max(p^Q,)  p 

subject  to  p^')(x„)  > p,  n = l,---,N,  l = (5.19) 

= l,a  > 0 

5.2.2  Regularized  AdaBoost.MO 


In  this  subsection,  we  derive  a regularized  AdaBoost.MO  algorithm. 

The  dual  form  of  (5.19)  is: 
min  7 

subject  to  Ylti  ELi  l)hJ\yin)M{yn,  1)<J,  i = 1,  • • • , \H\  (5.20) 

D{n,l)  e 

Following  the  same  arguments  presented  in  Chapter  4,  we  add  a penalty  term  on  the 
objective  function  to  control  the  distribution  skewness: 
min  7 + /5E{n,/}-^K0ln(T>(n,/)LA) 

subject  to  3 = l,---,\n\  (5.21) 

D{n,i)  e 

Following  the  same  derivation  of  AdaBoostKL,  the  primal  problem  of  the  linearized 
approximation  of  (5.21)  is: 


max(p  Q,)  p 

subject  to  pi'^(x„)  = 0(thf\xn)M{yn,  1)  + Ef=i  ln(Z)(*)(n,  l)LN)  > p 
n = 1,  • • • , A / = 1,  • • • , L 
at  = l,a  > 0 

(5.22) 


A soft  margin  can  be  defined  as: 


pW(x„)  = ^a*/i«(x„)M(2/„,0  + ^at\n{D^^\n,l)LN)  (5.23) 


t=i 


t=i 


Based  on  the  soft  margin  definition,  an  AdaBoost-like  algorithm  can  be  formulated 
with  a new  cost  function  defined  as: 


L N 


E = ^^exp(-p('^x„)^dt) 

t— 1 n=l  t 

L N / N 

= EE  exp  [ ~'^&thP{xn)M{yn,l)  - '^at\n{D^*\n,l)LN) 

t=l  n=l  V t t j 


(5.24) 
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The  combination  coefficient  dct  is  computed  as: 


L N 


OLt  = arg  max  EE  exp  I - ^ Oijhf{xn)M{yn,  0 “ \n{D^^\n,  l)LN) 


l=l  n=l 


t=l 


J=1 


(5.25) 


and  the  distribution  update  rule  is: 


= T>W(n,/)exp  - dtln(D(*)(n,/)LA^))  jCt  (5.26) 

The  regularized  version  of  AdaBoost.MO  is  summarized  as  follows: 


Regularized  AdaBoost.MO 

Initialization:  V = {(x„,y„)}^^j  e AxT,  Maximum  iteration  number  T,  = 

1/NL,  parameter  /3. 

for  t = l:T 

1.  Train  weak  learner  with  respect  to  distribution  and  get  a set  of 
hypotheses  {/it(x)}^^^. 

2.  Compute  the  combination  coefficient  (5.25), 

3.  Update  weights  as  (5.26), 

end 

Output  : F(x)  = [F(^)(x),  • • • , F^U(x)] 


5.3  AdaBoost.ECC  and  Regularized  AdaBoost.ECC 

In  this  section,  we  discuss  another  multiclass  AdaBoost  algorithm:  AdaBoost.ECC 
[33].  AdaBoost.ECC  was  derived  from  AdaBoost. OC  [67]  on  the  algorithm  level. 
Here  we  present  a more  strict  derivation  based  on  the  margin  concept.  We  show  that 
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AdaBoost.ECC  can  be  categorized  as  an  algorithm  performing  a stage- wise  func- 
tional gradient  descent  procedure  on  the  cost  function  of  margin  distributions.  A 
new  regularized  AdaBoost.ECC  algorithm  is  also  proposed. 

5.3.1  AdaBoost.ECC 

We  start  from  a new  margin  definition: 

p(x„)=  min  d(M(c),f(x„)) -d(M(?/„),f(x„))  (5.27) 

{cec,c5^j/„} 

where  d is  a distance  measure,  f(x„)  = [/i(x„),  • • • , /i(x„)]  and  M G {±1}^'"'^  is 
the  code  matrix.  The  row  or  the  code  word  for  the  class  is  denoted  as  M(c). 
The  physical  meaning  of  (5.27)  is  clear:  we  want  to  find  f(x„)  which  is  close  to  the 
corresponding  code  word  and  far  away  from  the  most  confused  class.  Note  x„  is 
correctly  classified  if  only  if  p(x„)  is  positive.  At  the  moment,  we  assume  the  code 
matrix  M is  available  or  can  be  generated  in  some  ways.  If  we  calculate  d as: 

d(M(c),f(x))  = ||M(c)-f(x)|p  (5.28) 

then,  (5.27)  becomes: 

p(x„)  = min  |KM(c)-f(x„)||2-||(M(y„)-f(x„)f 

\C€C7,C^J/n  } 

= 2M(?/„)F(x„)  - max  {2M(c)F(x„)}  (5.29) 

{ceC,cjf^yn} 

In  the  following  discussions,  we  drop  the  constant  term  for  notational  simplicity. 
Our  goal  is  to  maximize  the  margin  of  the  classifier,  which  can  be  formulated  as  an 
optimization  problem: 
max  p 

subject  to  M(y„)F(x„)  - max{cec,c5^j,„}{M(c)F(x„)}  > p (5.30) 

n = 1,  • • • , A 

It  is  equivalent  to: 

max  p 

subject  to  M(y„)F(x„)  - M(c)f^(x„)  > p 
n !,•••,  A, c !,•••, (7,c^  yn 


(5.31) 
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We  are  now  particularly  interested  in  finding  f(x)  using  the  AdaBoost  algorithm: 


equal  to  the  maximum  iteration  T.  The  above  optimization  then  can  be  written  as: 


In  light  of  the  connection  between  LP  and  AdaBoost,  it  is  reasonable  to  define  a new 
cost  function  on  which  AdaBoost  tries  to  optimize: 


Indeed,  the  following  theorem  shows  that  AdaBoost. ECC  performs  a functional  gra- 
dient descent  procedure  on  the  above  cost  function. 

Theorem  5.3.1  AdaBoost. ECC  implements  a stage-wise  functional  gradient  descent 
procedure  on  the  cost  function: 


Proof: 

For  the  time  being,  we  assume  that  the  code  matrix  M G is  available  or 

can  be  generated  in  some  ways.  In  fact,  the  code  matrix  M of  AdaBoost. ECC  is 
generated  in  a column-by-column  fashion  in  the  learning  process.  We  will  discuss 
this  issue  later. 

Rewrite  (5.35)  as: 


f(x)  = [/i(x),---,/t(x)] 

= [o;i/ji(x),---,Q!Tfir(x)] 


(5.32) 


where  Ylt=i  Ot  = 1,  a > 0.  It  is  clear  that  in  this  case  the  length  of  a code  word  is 


max  p 


subject  to  Oit{M{yn,  t)  - M{c,  f))/it(x„)  > p 


(5.33) 


71  1,  * * • , Ny  C 1,  C ^ 2/n 

Efci  Oi  = 1,  a > 0 


(5.34) 


(5.35) 


N C 


E = exp(-(M(?/n) -M(c))F^(x„)) 


”-l  {c=l,c:^y„} 


(5.36) 
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where  F(x„)  = [ai/ii(x„),  • • • , aT/ir(x„)]ixT- 

In  the  {t+iy^  iteration,  Ft(x„)  = [o!ihi(x„),  • • • , 5;tht(x„),  0,  • • • , 0]ixt-  Define 
the  negative  functional  derivative  of  E with  respect  to  Ft(x)  as: 
f 0 if  X ^ x„ 

- V£(F,(x))  = I Efc=i,.#».)  (-(JW(»„)  - M(c))Ff(x„))  (M{y„)  - M(c)f 

[ if  X = x„,  n = 1,  • • • ,iV 

(5.37) 

We  want  to  find  F(+i(x„)  = [o;i/ii(x„),  • • • , atht{xn),  &t+iht+i{xn),  0,  • • • , 0]ixt-  That 
is,  only  the  {t  + 1)^^  entry  of  F(x„)  is  updated.  We  select  ht+i  as  the  one  most 
correlated  with  the  {t  + 1)*^  entry  of  — V£'(Ft(x)),  i.e.. 


N 


ht+i  = argmax^  ^ exp  (-(M(y„)  - M(c))Ff  (x„))  ... 

n=l  {c=l,c^2/„} 

(M(j/„,  t + 1)  - M(c,  t + 1))  /i(x„) 


(5.38) 


Define: 


N C 

Ut+i=Y^  ^ exp  (-(M(?/„)  - M(c))Ff(x„))  |M(y„,t  + 1)  - M(c,t  + 1)1 

n=l  {c=l,cjtyn} 

(5.39) 


and 


d(‘+')(n)  = 


_ Efc=i,c^y„}  exp  (-(M(?/„)  - M{c))Fj (x„))  \M{yn,  t + 1)-  M{c,  t + 1)1 

Ut+i 

(5.40) 


It  immediately  follows  that 

N 

ht+i  ==  argmaxI7t+i  Vd(‘+^)(n)sign(M(y„,t  + 1)  - M(c,  t + 1))  h(x„)  (5.41) 

h€ri  ' ^ 


n=l 


Note  that  {M{yn,  t + 1)  — M{c,  t + 1))  either  is  equal  to  zero  or  has  the  same  sign  as 
M{yn,t  + 1).  Then: 


N 


ht+i  = arg  max  ^ {n)M{yn,  t + l)/i(x„)  = arg  min  et+i  (5.42) 

h^'ht  __  h^'ht 


n=l 


where  et+i  is  the  weighted  error  with  respect  to 


N 


6t+i  = ^d(‘+^^(n)I(M(y„,t  + 1)  7^/i(x„)) 


n=l 


(5.43) 
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After  ht+i  is  found,  at+i  can  be  computed  by  a line  search: 


dt+i  = argmin£'t+i 

a>0 

N C t 

= argmm^  exp  - M(c,j))/ij(x„)  - ... 

n=l  {c=l, C5>ty„}  j=l 

&{M{yn,t+l)  - M{c,t  + l))ht+i{xn)^  (5.44) 

If  /i  is  a binary  classifier,  at+i  can  be  solved  analytically.  Taking  a derivative  of  Et+i 

with  respect  to  d(+i  and  setting  it  to  zero  gives: 

BE  ^ / ^+1  \ 

- M{c,j))hj{Xn)  ... 

n^l  {c=l,c/y„}  V j=l  / 

(M{yn,t  + 1)  - M{c,t  + l))ht+l{Xn) 

= 0 (5.45) 

It  is  a bit  tricky  to  solve  (5.45)  for  dt+i.  Note  that  {M{yn,t+  1)  - M(c,  t + 1))  takes 
values  from  {0,2, —2}.  If  {M{yn,t  + 1)  — M{c,t  + 1))  = 0,  it  contributes  nothing 
to  (5.45).  We  also  know  that  {M{yn,t  + 1)  - M{c,t  + 1))  has  the  same  sign 

M{yn,t  + 1).  Therefore,  (5.45)  can  be  rewritten  as: 

BE  N c / t \ 

Y ^^p  - M{c,j))hj{xn)  I ... 

"=1  {c=l,c^yn}  V j = l / 

|M(y„,t  + 1)  - M(c,t  + 1)1  exp  (-2o:t+iM(?/„,t  + l)/it+i(x„))... 

M(y„,t  + l)/it+i(x„) 

N 

Ut+i  Y (-2o!t+iM(yn,  t + l)ht+i  (x„))  M(?/„,  t + l)/it+i(x„) 

N 


as 


n=l 


= t^t+i(  Y 4‘+^^exp(-2at+i)  - ... 

{n=l,M(3/„,t+l)=/it+i(x„)} 


N 


Therefore, 


Y 4‘+^)exp(2Q(+i)) 

{n=\,M{yn,t+l)jiht+l{Xn)} 

0 


1 1 ~ ^t+1 

at+i  = 7 In  

4 V e<+i 


(5.46) 


(5.47) 
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After  ht+i  and  Af+i  are  found,  the  distribution  is  updated  as: 

= D(*+^)(n,c)exp  ^-d(+i(M(y„,t+l)-M(c,t+l))/ii+i(x„)^/Ct  (5.48) 

where  Ct  is  a normalization  term.  Comparing  each  step  of  AdaBoost.ECC,  which 
we  will  summarize  below,  we  conclude  that  AdaBoost.ECC  implements  a stage-wise 
functional  gradient  descent  procedure  on  the  cost  function  (5.35). 


AdaBoost.ECC  [33]  is  summarized  as  follows: 


AdaBoost.ECC 

Initialization:  T>  = {(x„,y„)}^^^  e Ax^V,  Maximum  iteration  number  T,  = 

1/N{C  — 1)  if  c y„  and  otherwise  0. 

for  t — 1 ■.  T 

1.  Define  the  column  of  M. 

2.  Define  as  (5.39). 

3.  Calculate  as  (5.40), 

4.  Train  weak  learner  with  respect  to  distribution  d^*^  and  get  a hy- 
potheses /it(x). 

5.  Calculate  the  weighted  error  Cf  as  (5.43), 

6.  Compute  the  combination  coefficient  &t  as  (5.47). 

7.  Update  weights  as  (5.48). 

end 


Output  : F(x)  = [o!i/ii(x),  • • • , 0!T/ir(x)] 
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Now  we  discuss  how  to  generate  the  code  matrix  M.  Considering  (5.38), 
with  the  {t  + column  of  M,  M.j+i,  unspecified,  our  goal  becomes:  finding 
and  /it+i  simultaneously  to  maximize  the  correlation  with  the  {t  + entry  of 
— V£'(F((x)).  Unfortunately,  this  problem  is  proven  to  be  NP-hard.  We  mitigate 
the  problem  by  doing  the  optimization  separately  on  the  following  two  parts:  we 
first  find  M.t+i  to  maximize  Ut+i  and  then  we  train  ht+i  based  on  (5.38).  The  first 
problem  is  the  so-called  “Max-Cut”  problem.  The  Max-Cut  problem  is  NP-complete. 
However,  in  the  commonly  encountered  problem,  say  C = 10,  an  exhaustive  search 
is  tractable.  The  second  problem  is  simply  the  classification  performance  under  the 
partition  defined  by  the  Max-Cut  problem. 

5.3.2  Regularized  AdaBoost.ECC 


Now  we  start  deriving  a regularized  AdaBoost.ECC  algorithm.  For  simplicity, 
we  temporarily  assume  that  the  cardinality  of  V.  is  finite.  The  dual  form  of  (5.33)  is: 
min  7 

subject  to  EiLiES=i^Kc)(M(y„,j)  -M(c,i))/ij(x„)  <7,  j = 1,...,|?{| 
Y.n=\  D(n,  c)  = 1,  D{n,  c)  > 0 

(5.49) 

We  control  the  distribution  skewness  by  adding  a penalty  term  to  the  objective 
function  in  the  above  optimization  problem.  The  penalty  function  measures  the 
distance  between  the  query  distribution  and  the  center  distribution.  Two  particular 
choices  are  the  K-L  distance  and  I2  distance.  Here  we  only  consider  the  K-L  distance. 
However,  all  of  the  derivations  can  be  readily  extended  to  the  I2  case, 
min  7 + /3  In  A/' 

subject  to  Yn=i  ES=i  D{n,  c){M{yn,j)  - M{c,j))hj{-Xn)  <7.  i = 1, ' • • , I'^l 
JlLi  Efc=i,c#s/„}  D{n,  c)  = 1,  D(n,  c)  > 0 


(5.50) 
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where  = X]fc=i,c5>ty„}  The  above  optimization  problem  can  be  reformulated 

as: 

min  7 

s.t.  Sj(D)  = D{n,c){M{yn,j)  - M{c,j))hj{xn)  + ^Yln=i  dnlndnN  < 7 

En=i  E{c=l,c^yn}  c)  = 1>  D{n,  c)  > 0 

(5.51) 

Define  s(D)  = maxi<j<|7^|  Sj(D).  Note  s(D)  is  a non-smooth  convex  function  in 
D since  Sj(D)  is  a convex  function.  Suppose  now  we  have  a set  of  query  distribu- 
tions S = For  each  query  distribution  we  can  find  one  supporting 

hyperplane  given  by: 

7 = s(DW)+et(D-D('))  (5.52) 

where  is  one  element  of  the  subdifferential  9s(D(‘^)  of  s evaluated  at  Here  we 
abuse  the  notation  a bit  by  denoting  D as  a matrix  and  a vector  generated  from  D 
by  stacking  all  rows  in  a specific  order.  We  also  use  M{n,c,j)  — M{yn,j)  — M{c,j) 
to  simplify  the  notation.  Due  to  the  convexity  of  s,  a supporting  hyperplane  gives  an 
underestimate  of  s.  More  precisely,  each  term  of  Equation  (5.52)  can  be  written  as: 

s(D^*h  = max  s,(D^‘^) 

= EE  D^^\n,c)M{n,c,t)ht{xn)  + (5.53) 

c n n 

where 

ht  = max  EE  D^^\n,c)M{n,c,j)hj{xn)  (5.54) 

ilj  fc  ft, 

c n 

and  is  given  as: 

ds  - 

Qd(^'^d(n,c)=dW{n,c)  = M(n,  c,  t)/it(x„)  + /3  (in -h  l)  (5.55) 
Plugging  (5.53)  and  (5.55)  into  (5.52)  yields: 

^ = EE  D{n,  c)M{n,  c,  t)ht(x„)  -I-  /3  ^ In  d^^^^N 


C 


n 


n 


(5.56) 
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The  original  convex  optimization  problem  can  be  approximated  as  a linear  program- 
ming problem: 


min  7 

subject  to  En  c)M(n,  c,  t)/it(x„)  + In  Sn^N  < 7 


(5.57) 


En=i  Efc=i,c/s,„}  c)  = 1,  D{n,  c)  > 0 

Like  LPnorm2"AdaBoost,  the  query  distributions  can  be  obtained  through  the  column 
generation  technique.  However,  since  the  code  matrix  is  unknown  to  us  in  advance, 
we  need  to  generate  M in  a column-by-column  fashion.  In  the  iteration,  we  want 
to  find  M.t  and  ht  so  that  the  inequality  constraint  is  violated.  It  is  equivalent  to 
maximizing  the  following  terms: 


D^^\n,c)M{n,c,t)ht{xn)  (5.58) 

c n 

with  respect  to  the  column  of  M and  ht,  which  is  the  same  problem  we  face  in 
AdaBoost.ECC.  Therefore,  we  can  directly  apply  the  same  solution  we  presented  in 
the  last  subsection  herein  without  any  modifications. 

The  regularized  version  of  AdaBoost.ECC  is  summarized  as  follows: 


Regularized  AdaBoost.ECC 

Initialize:  V = {{xn,yn)}n=i  ^ A x T,  Maximum  iteration  number  T,  D^^\n,c)  = 
1/N{C  — 1)  if  c / ?/„  and  otherwise  0,  parameter  /5. 

for  t = 1 : T 

1.  Define  the  column  of  M. 

2.  Train  base  learner  to  maximize  U in  (5.58)  and  get  hypothesis 
/if(x)  : X {±1}. 
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3.  Solve  the  optimization  problem: 

(D*,7*)=  argmin  7 

i = 

En=l  Efc=l,c^y„}  c)  = 1,  D(n,  c)  > 0 

(5.59) 

4.  Update  weights  as  = D*. 


end 


Output  : F(x)  = [o:i/ii(x),  • • • , where  a is  the  Lagrangian  multipliers  from 

the  last  LP. 


5.4  Summary 

In  this  chapter,  we  have  investigated  the  multiclass  AdaBoost  algorithms. 
Particularly,  AdaBoost. MO  and  AdaBoost. ECC  are  considered.  We  have  shown  that 
both  algorithms  can  be  viewed  as  performing  functional  gradient  descent  procedure  on 
certain  cost  functions  of  margin  distributions.  We  have  also  proposed  two  regularized 
multiclass  AdaBoost  algorithms. 


CHAPTER  6 

APPLICATION  TO  LANDMINE  DETECTION 

6.1  Introduction 

Landmines  are  causing  enormous  humanitarian  and  economic  problems  in 
many  countries  all  over  the  world.  Experts  estimate  that  up  to  110  million  land- 
mines need  to  be  cleared  and  more  than  20, 000  civilians  are  killed  or  maimed  every 
year  by  landmines,  with  many  of  the  victims  being  children  [2].  However,  landmine 
detection  and  clearance  have  turned  out  to  be  an  extremely  challenging  problem.  At 
the  current  clearance  rate,  it  will  take  about  1, 000  years  to  remove  all  landmines 
that  are  already  placed  and  for  every  landmine  cleared,  a further  20  are  being  buried. 
Therefore  it  is  urgent  to  develop  a safe  and  cost  efficient  landmine  detection  sys- 
tem. In  the  past  fifteen  years,  various  techniques,  including  acoustic  sensor,  infrared 
technique,  quadrapole  resonance  and  down-  and  forward-looking  ground  penetrating 
radar  (ELGPR),  have  been  investigated.  Among  them,  FLGPR  has  several  advan- 
tages over  others,  including  long  standoff  distances  and  fast  interrogation  of  a large 
area,  and  thus  is  considered  a viable  technology  for  landmine  detection  [41],  [72]. 

In  this  chapter,  we  consider  landmine  detection  using  forward-looking  ground 
penetrating  radar.  As  shown  in  Figure  6.1,  this  system  has  GPR  antennas  mounted 
on  the  front  of  a vehicle  and  captures  radar  signals  as  the  vehicle  moves  forward. 
Detailed  descriptions  on  the  system  can  be  found  in  [41].  With  the  use  of  the  re- 
ceiver antenna  array,  a high  resolution  radar  image  can  be  formed  from  the  received 
signals.  Our  task  is  to  detect  the  presence  of  the  landmines  in  the  radar  images. 
It  can  in  general  be  formulated  as  an  object  recognition  problem.  The  conventional 
signal  detection  techniques  such  as  the  matched  filter  method  may  not  be  applicable 
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here  since  it  is  very  difficult,  if  not  impossible,  to  estimate  the  target  signatures  as 
well  as  the  clutter  statistical  properties  due  to  the  extremely  complicated  operating 
environments.  One  possible  method  to  overcome  this  problem  is  to  design  a classifier 
through  learning,  which  is  often  used  in  the  problem  of  detecting  faces  or  pedestrians 
in  highly  cluttered  scenes  [54],  [55].  However,  compared  to  the  conventional  object 
recognition  problem,  landmine  detection  has  its  own  specific  challenges.  First,  in 
contrast  to  face  recognition  and  character  recognition  where  intensity  images  provide 
us  with  an  abundant  amount  of  human  recognizable  information,  radar  images  can 
only  be  fully  understood  through  the  analysis  of  radar  scattering  phenomena.  Based 
on  the  object  size,  physical  structures  and  composition  materials,  different  objects 
react  to  incident  radar  signals  differently  [13], [12], [42].  Hence,  how  to  quantify  these 
differences  and  thereby  design  a classifier  becomes  the  key  to  the  success  of  landmine 
detection.  In  the  context  of  pattern  classification,  this  process  is  referred  to  as  feature 
extraction.  One  may  bypass  this  stage  by  using  some  special  classifiers  (for  example 
neural  network)  which  have  a feature  extractor  implicitly  embedded  in  the  classifier 
structure.  In  our  case  of  a limited  number  of  training  samples  (in  particular  mine 
samples),  however,  this  scheme  may  not  be  applicable.  Though  numerical  simula- 
tions may  help  us  identify  these  useful  information,  given  the  extremely  complicated 
surveying  scenarios,  the  usefulness  of  simulated  information  is  very  limited  and  we 
must  resort  to  a training  based  method  which  is  capable  of  automatically  extracting 
the  intricate  structures  of  the  target  signals  through  learning.  The  second  challenge 
in  landmine  detection  is  that  in  contrast  to  the  case  of  pattern  classification  where 
we  only  need  to  decide  between  well-defined  classes,  a landmine  detector  is  required 
to  differentiate  mines  from  the  rest  of  the  world.  While  the  mine  samples  are  well- 
defined,  there  are  no  typical  examples  for  clutter.  In  other  words,  clutter  can  be 
anything  other  than  mines.  It  requires  us  to  design  a classifier  which  has  sufficient 
expression  power  to  claim  a region  in  the  high  dimensional  feature  space  for  mines. 
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Note  that  this  requirement  is  in  conflict  with  the  previous  process  of  feature  extrac- 
tion in  some  ways.  On  the  one  hand,  we  want  to  reduce  the  data  dimensionality  to 
improve  the  classifier  generalization  capability  over  unseen  samples;  yet  at  the  same 
time,  the  classifier  should  have  sufficient  expression  power  to  attain  a low  training 
error. 

A conventional  classifier  design  procedure  usually  consists  of  two  modules: 
feature  extraction  and  classifier  training  (Figure  6.2).  This  chapter  covers  every 
aspect  of  designing  a classifier.  To  start  the  work,  the  first  important  thing  we 
need  to  do  is  to  find  out  what  kinds  of  features  we  can  extract.  Toward  this  end, 
we  have  introduced  the  time-frequency  analysis  into  the  landmine  detection  [72]. 
Through  the  time-frequency  analysis,  we  can  obtain  graphical  understanding  on  how 
different  objects  react  to  the  incident  radar  signals  differently.  We  also  find  that  the 
most  discriminant  information  between  the  two  classes  are  time-frequency  localized. 
This  observation  motivates  us  to  propose  a wavelet  packet  transform  (WPT)  based 
detector.  Wavelet  packet  transform,  with  its  flexible  decomposition  capability,  is  used 
to  encode  the  time-frequency  localized  information  efficiently  into  several  bases  and 
then  a feature  selection  method  is  used  to  find  these  components  by  optimizing  a 
certain  cost  function.  With  these  selected  features,  a neural  network  classifier  with 
a simple  structure  is  designed. 

However,  the  conventional  classifier  design  procedure  cannot  solve  the  afore- 
mentioned issues  of  detecting  targets  in  unconstrained  environments.  To  our  knowl- 
edge, this  problem  is  not  fully  addressed  in  the  literature  and  even  less  in  the  radar 
community.  For  example,  in  feature  selection,  the  features  so-produced  are  aimed  at 
optimizing  a certain  feature  selection  criterion  with  a predefined  number  of  features 
based  on  the  entire  training  data  set  and  thus  are  not  in  favor  of  some  particular 
training  samples.  For  this  reason,  we  refer  to  these  features  as  the  global  features. 
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Down  Range 

(c) 

Figure  6.1:  (a)  An  illustration  of  the  SRI  FLGPR  system,  (b)  A photograph  of  the 
SRI  FLGPR  system,  (c)  A typical  radar  image  for  mines.  The  term  “M-3”  denotes 
a metal  mine  buried  at  the  depth  of  3 inches. 
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Figure  6.2:  A conventional  pattern  classifier  design  procedure. 


Concerning  over  the  problem  of  the  unconstrained  clutter  samples,  we  need  to  uti- 
lize the  local  information  in  the  data  space,  but  not  at  the  cost  of  decreasing  the 
classifier  generalization  capability.  In  the  chapter,  we  show  that  this  problem  can 
be  alleviated  by  using  the  Boosting  method.  In  particular,  the  AdaBoost  with  soft 
decisions  is  used.  The  main  idea  of  the  Boosting  method  is  to  train  an  ensemble  of 
classifiers  sequentially  with  the  subsequent  classifiers  focusing  on  the  errors  made  by 
the  previous  ones.  Hence  the  Boosting  method  provides  us  with  a way  to  identify 
the  hard  examples  of  separating  mines  from  clutter.  With  these  examples,  a new  set 
of  features  which  provide  the  specific  discriminant  information  for  the  misclassified 
samples  can  be  extracted  adaptively  and  a new  classifier  can  be  trained.  The  final 
decision  is  calculated  as  the  weighted  combination  of  the  decisions  of  the  member 
classifiers.  The  experimental  results  based  on  the  measured  data  show  that  with  this 
classification  scheme,  significant  improvements  on  both  the  training  and  the  test- 
ing performances  can  be  achieved  while  at  the  same  time  no  apparent  overfitting  is 
observed. 

The  remainder  of  the  chapter  is  organized  as  follows.  In  Section  6.2,  we  give  a 
brief  description  on  the  SRI  (Stanford  Research  Institute)  FLGPR  system  and  present 
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the  time-frequency  analysis  for  mine  and  clutter  signals.  Based  on  these  discussions, 
in  Section  6.5,  we  propose  a wavelet  packet  transform  based  landmine  detector.  We 
give  a detailed  discussion  on  how  WPT  can  be  used  to  extract  localized  information 
and  how  the  Boosting  method  can  be  used  to  improve  the  detection  performance 
significantly.  The  effectiveness  of  this  landmine  detector  is  demonstrated  in  Section 
6.6  based  on  the  SRI  measured  data.  Finally,  we  conclude  our  work  in  Section  6.7. 

6.2  System  Descriptions  and  Problem  Formulation 

A photograph  of  the  SRI  FLGPR  system  is  shown  in  Figure  6.1(b).  This 
system  consists  of  2 transmitter  and  18  receiver  quad-ridged  horn  antennas.  The 
height  of  the  transmitter  antennas  (two  large  horns)  is  about  3.3  m above  the  ground 
and  their  geometry  center  is  3.03  m away  from  each  other.  The  18  receiver  antennas 
are  horizontally  equally  spaced  with  17  cm  center  to  center  and  the  height  of  the 
bottom  row  is  about  2 m above  the  ground.  The  ultra  wideband  stepped  frequency 
system  operates  at  1024  discrete  frequencies  evenly  spaced  over  the  frequency  range 
from  442.5  to  3000  MHz  with  a step  size  of  2.5  MHz  starting  from  the  lowest  frequency. 
The  two  transmitter  antennas  work  sequentially  and  all  the  receiver  antennas  work 
simultaneously.  Hence  there  is  a total  of  36  channels  of  the  received  signal  for  each 
scan  or  vehicle  location.  Data  are  recorded  while  the  vehicle  moves  forward  and  the 
distance  between  two  adjacent  scans  is  about  2 m.  The  GPS  (Global  Positioning 
System)  is  used  to  measure  the  location  of  the  system  for  each  scan.  With  the 
use  of  the  delay-and-sum  imaging  algorithm,  a high  resolution  radar  image  can  be 
formed  from  the  received  signals.  At  each  scan  location,  the  image  region  is  6 m 
(cross-range)  by  30  m (down-range)  with  a 7 m standoff  distance  ahead  of  the  vehicle 
(Figure  6.1(a)).  A pixel  spacing  of  4 cm  is  used  in  both  the  down-range  and  cross- 
range dimensions.  In  the  experiment,  we  only  use  the  images  ranging  from  10  m to 
20  m ahead  of  the  vehicle.  The  system  is  optimized  for  this  range. 
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Figure  6.3:  The  down-range  and  cross-range  profiles  of  a metal  mine.  The  down-range 
profile  contains  much  more  information  than  the  cross-range  profile. 

In  Figures  6.6  and  6.7,  we  present  several  radar  image  chips  for  clutter  and 
two  types  of  buried  metal  mines,  denoted  by  Ml  and  M2,  respectively,  which  are  used 
in  the  experiment.  The  image  chip  has  32  x 32  pixels.  One  may  start  the  classifier 
design  with  the  2D  image  chips  directly.  In  our  case  of  a limited  number  of  training 
samples  (in  particular  mine  samples,  cf..  Section  6.6),  however,  it  will  easily  lead  to 
overfitting.  Note  that  most  of  the  target  information  is  embedded  along  the  down- 
range  direction  due  to  a much  higher  resolution  in  the  down-range  direction  than  that 
of  the  cross-range  direction  (Figure  6.3),  which  is  determined  by  the  aperture  size 
and  wavelength.  Therefore,  our  strategy  is  to  use  the  appearance  based  Fisherface 
method  [46]  as  a prescreener  to  exploit  the  global  information  and  then  further  check 
the  down-range  profile  through  the  center  of  each  image  chip  for  the  final  decision, 
the  latter  of  which  is  the  main  focus  of  this  chapter.  This  strategy  is  also  adopted  in 
[16]. 

6.3  Time-Frequency  Analysis 

The  Fourier  transform  uses  sinusoidal  functions  as  the  basis  functions  and 
dilutes  the  transient  components  of  a signal  along  the  entire  basis.  As  a result,  the 
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Fourier  transform  is  not  suitable  for  the  analysis  of  time- varying  signals.  One  of  the 
simplest  ways  to  overcome  the  drawback  is  the  short-time  Fourier  transform  (STFT), 
which  is  given  as  follows: 

STFT^(t,  f)  = J (6.1) 

where  s{t)  is  the  signal  of  interest,  ^{t)  is  an  analysis  window  function  and  the  super- 
script * denotes  the  complex  conjugate.  Note  that  the  resulting  time-frequency  image 
is  significantly  infiuenced  by  the  choice  of  the  window  function.  More  specifically,  a 
short  window  leads  to  good  time  resolution  but  poor  frequency  resolution,  and  vice 
versa.  Therefore,  in  the  case  where  both  high  time  and  frequency  resolutions  are 
required,  STFT  is  found  to  be  inadequate  [59]. 

6.3.1  Wigner-Ville  Distribution 


The  STFT  computes  correlations  between  the  signal  and  a family  of  basis 
functions.  Thus  the  time  frequency  resolution  is  governed  by  the  corresponding 
set  of  elementary  functions.  In  addition  to  the  correlation  based  approach,  there 
is  another  type  of  time-frequency  representation  which  is  motivated  by  the  time- 
frequency  energy  density.  Recall  that  when  we  compute  the  power  spectral  density 
(PSD),  one  way  is  to  compute  the  Fourier  transform  of  the  auto-correlation  function 
as  follows 

|S(/)I"  = / R{T)e-‘^’-'^dr  (6.2) 

where  the  auto-correlation  function  i?(r)  is  defined  as: 


s{t  + T/2)s*{t 


R{t,  T)dt 


(6.3) 


Note  that  R{r)  is  actually  the  average  of  the  instantaneous  correlation  R{t,  r)  defined 
as  above.  By  averaging,  we  lose  the  time  information.  As  a counterpart  of  the  PSD, 
we  substitute  R{t)  with  R{t,  r)  which  leads  to: 

WVD{t,f)^  [ R{t,T)e-^^^f^dr  (6.4) 
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Equation  (6.4)  is  usually  called  the  Wigner-Ville  distribution  (WVD)  [58],  [59].  Obvi- 
ously, there  is  no  window  effect  anymore.  Compared  to  the  STFT,  the  WVD  offers  a 
better  signal  representation  in  the  time-frequency  domain.  In  addition  to  the  high  res- 
olution, the  WVD  possesses  many  desirable  properties  for  signal  analysis.  However, 
it  is  also  recognized  that  the  WVD  suffers  from  a so-called  cross-term  interference 
problem  which  prevents  the  WVD  from  being  used  for  many  practical  applications. 
Since  the  cross-term  is  almost  always  oscillatory,  the  most  straightforward  way  of 
removing  the  cross-term  is  through  2D  lowpass  filtering: 

C{t,  f)  = 1 1 y)WVD{t  -x,f-  y)dxdy  (6.5) 

where  ^{x,  y)  is  a 2D  lowpass  filter.  The  equation  can  also  be  rewritten  as: 


C(t,f)  = J j ^v,T)AF{i^,T)e’^’'^“'^MdvdT 


(6.6) 


where  $(i^,  r)  is  the  Fourier  transform  of  and  AF  is  the  ambiguity  function 

(AF)  of  the  signal  s{t)  defined  as: 


AF(i/,r)  = J j WVD(t, 

= J s{t  + T/2)s*{t  — T/2)e~^‘^'^''^dt 


(6.7) 


From  Equation  (6.6),  we  see  that  the  2D  convolution  between  the  lowpass  filter 
and  the  WVD  can  also  be  achieved  by  the  multiplication  of  the  AF  with  a kernel 
function.  The  equation  is  commonly  named  as  Cohen’s  classes  [58],  [59],  which  can 
greatly  facilitate  the  selection  of  our  desired  kernel.  Since  different  kernel  functions 
determine  different  properties  of  the  resulting  time-frequency  representation  of  a 
given  signal,  the  kernel  function  selection  is  an  application  specific  task.  In  the 
following  subsection,  we  will  detail  the  kernel  function  selection  issue  based  on  our 
particular  problem  of  plastic  mine  detection. 
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6.3.2  Choi-Williams  Distribution 

Before  proceeding,  let  us  take  a look  at  a simple  toy  example  which  is  closely 
related  to  our  application.  Suppose  we  have  a signal  s{t)  composed  of  two  signal 
components:  one  is  a sinusoid  and  the  other  is  an  impulse,  i.e., 

s{t)  = + 5{t)  ^ s,{t)  + S2{t)  (6.8) 

The  ambiguity  function  of  the  signal  is  then  given  by: 

AF5  = J s{t  + T/2)s*{t  — 

= AFs,  + AFs2  + AF5JS2  + AF52S1  (6-9) 

where  AF«j  and  AF^^  are  the  auto-ambiguity  functions  of  the  signals  Si{t)  and  S2{t), 
respectively,  and  AF^isj,  AF^j^j  are  the  cross-ambiguity  functions  between  the  signals 
Si(t)  and  S2{t).  From  Equation  (6.9),  we  see  that  the  auto-ambiguity  functions 
are  distributed  only  along  the  time-delay  and  Doppler-shift  axes.  This  observation 
suggests  the  use  of  the  Choi-Williams  distribution  since  the  corresponding  kernel 
function  is  given  by  : 

$(zv,  r)  = (6.10) 

It  is  easy  to  see  that  the  kernel  function  preserves  the  information  along  both  axes 
while  suppressing  the  cross-term  away  from  the  axes.  The  parameter  a controls  the 
decay  spread.  It  should  be  noted  that  since  the  kernel  keeps  the  cross-term  in  both 
axes,  there  are  some  horizontal  and  vertical  ripples  in  the  time-frequency  domain. 
The  resulting  Choi-Williams  distribution  for  this  toy  example  is  shown  in  Figure  6.4. 

6.3.3  Implementation  Considerations 

It  is  well-known  that  for  real- valued  signals,  the  WVD  suffers  from  an  aliasing 
problem  because  the  period  of  the  WVD  is  tt  instead  of  27t.  Even  for  complex- valued 
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Figure  6.4;  The  ambiguity  function  (AF)  and  the  Choi-Williams  distribution  for  the 
toy  example:  (a)  the  AF  function  and  (b)  the  Choi-Williams  distribution. 

signals,  the  aliasing  problem  still  exists.  Basically,  there  are  two  ways  to  deal  with  the 
problem:  one  is  to  use  analytic  signals  instead  of  original  signals  [7].  Since  analytic 
signals  are  derived  through  the  Hilbert  transform  to  discard  the  negative  frequency 
components,  it  can  be  shown  that  using  analytic  signals  will  alter  the  original  WVD, 
particularly  in  the  low  frequency  band  [58].  The  other  one  is  to  oversample  the 
original  signals  to  avoid  the  aliasing,  which  is  what  we  use  herein. 

6.4  Time-Frequency  Patterns 

So  far,  we  have  developed  the  tools  for  our  further  analysis.  The  goal  of 
the  section  is  twofold.  First,  by  displaying  the  signal  in  the  time-frequency  domain 
through  a proper  distribution,  we  wish  to  justify  the  claims  we  have  made  in  Intro- 
duction. It  can  be  served  as  a guide  for  our  future  work.  Second,  we  will  present 
several  time-frequency  patterns  for  different  mine  types,  burial  depths  and  environ- 
ments as  well  as  for  clutter.  These  images  bear  some  interesting  interpretations  that 
are  useful  for  feature  extraction. 
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Figure  6.5:  A time-frequency  distribution  for  a mine  embedded  in  a noisy  background. 
The  background  shows  significant  nonstationarity.  A mine  is  located  around  the  time 
instance  of  0.03  //s. 


6.4.1  Nonstationarity  of  the  Background 

Figure  6.5  shows  the  signal  of  a mine  contaminated  by  background  clutter  and 
the  corresponding  time-frequency  representation.  We  have  oversampled  the  signal 
and  the  sampling  frequency  is  12  GHz.  It  is  clear  from  the  figure  that  the  background 
signal  shows  significant  nonstationarity.  It  is  interesting  to  note  that  most  often  the 
energy  of  the  background  clutter  is  concentrated  around  the  high  frequency  band. 
This  may  explain  why  some  energy  based  detectors  (such  as  CFAR)  work  better  in 
the  lower  subband  images  from  0.3  GHz  to  2 GHz.  However,  it  is  also  clear  from  the 
figure  that  discarding  the  high  frequency  information  arbitrarily  also  risks  the  loss  of 
the  target  information. 

6.4.2  Mine  Patterns 

It  is  reported  that  landmines  have  a salient  double-peak  signature  in  the  spatial 
domain,  which  can  be  interpreted  as  the  signals  returned  from  the  front  and  rear  edges 
of  a mine  [41].  It  becomes  even  more  evident  in  the  time-frequency  domain.  In  Figure 
6.6  we  present  several  time-frequency  representations  for  two  types  of  mines,  denoted 
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Figure  6.6:  (a)  Mine  picture,  the  corresponding  radar  image  chip  and  the  time- 
frequency  representation  for  the  Ml  mine  buried  at  the  depth  of  3 inches;  (b)  Mine 
picture,  the  corresponding  radar  image  chip  and  the  time-frequency  representation 
for  the  M2  mine  buried  at  the  depth  of  3 inches. 


as  Ml  and  M2,  both  buried  at  the  depth  of  3 inches.  The  time  domain  signals,  the 
power  spectral  densities  and  the  metal  mine  pictures  are  also  presented.  These  TF 
representations  can  be  roughly  interpreted  as  follows:  the  front  and  rear  edges  can 
be  modeled  as  the  scattering  centers,  with  each  edge  serving  as  a discrete  event  in 
time.  In  the  time-frequency  domain,  it  shows  up  as  a vertical  line  in  the  image  since 
it  occurs  at  a particular  time  instance  but  over  all  frequencies  [13].  The  edges  can  be 
used  as  a salient  feature  for  the  discrimination  of  mines  from  clutter  and  even  possibly 
for  the  discrimination  among  different  types  of  mines.  Another  feature  in  the  time- 
frequency  domain  is  the  existence  of  resonances.  The  resonance  can  be  modelled  as 
a discrete  event  in  frequency  that  becomes  prominent  at  a particular  frequency  and 
shows  up  in  the  time-frequency  domain  as  a horizontal  line.  For  comparison,  we  also 
present  two  clutter  image  chips  and  the  corresponding  TF  representations  in  Figure 
6.7.  It  should  be  noted  that  these  two  samples  are  by  no  means  the  typical  examples 
for  clutter. 
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Figure  6.7:  (a)  Smashed  coke  can  picture,  the  corresponding  radar  image  chip  and 
the  time-frequency  representation  for  the  can  buried  at  the  depth  of  3 inches;  (b) 
Wood  plate  picture,  the  corresponding  radar  image  chip  and  the  time-frequency  rep- 
resentation for  the  wood  plate  buried  at  the  depth  of  3 inches. 


Through  the  time-frequency  analysis,  we  can  get  a graphical  understanding  on 
how  a mine  reacts  to  incident  radar  signals.  A question  arises  naturally:  how  can  we 
extract  these  time-frequency  localized  information  efficiently  for  signal  classification? 
We  may  not  be  able  to  use  the  CWD  directly  since  after  the  CWD  transform,  a TF 
image  contains  too  much  redundant  information  as  well  as  cross  terms.  One  widely 
used  TF  technique  for  signal  classification  purposes  is  the  discrete  wavelet  transform 
(DWT).  An  apparent  drawback  associated  with  DWT  is  the  poor  frequency  resolution 
at  the  high  frequency  range  and  the  poor  time  resolution  at  the  low  frequency  range. 
When  used  for  classification  applications,  it  may  have  difficulties  in  identifying  desired 
discriminant  features  at  the  needed  ranges.  Wavelet  packet  transform,  on  the  other 
hand,  uses  a rich  library  of  redundant  bases  with  arbitrary  TF  resolution  [14]  and 
therefore,  compared  with  DWT,  WPT  is  more  suitable  for  extracting  features  from 
signals  having  nonstationary  as  well  as  stationary  components.  We  will  give  detailed 
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discussions  below  on  how  the  wavelet  packet  transform  can  be  used  to  extract  the 
information,  based  on  which  a classifier  can  be  designed. 

6.5  Landmine  Classifier  Design 
6.5.1  Wavelet  Packet  Transform 


Wavelet  packet  transform,  which  is  an  important  generalization  of  DWT,  pro- 
vides a much  more  flexible  signal  decomposition  scheme  than  DWT.  Like  DWT, 
wavelet  packet  basis  functions  are  also  formed  by  scaling  and  translating  a family  of 
basis  functions: 

Wj,b,k{t)  = 2~^l‘^Wb{2~H  - k),j,  fc  G Z (6.11) 


where  Z is  the  set  of  all  integers.  However,  for  WPT,  in  addition  to  the  scaling 
parameter  j and  translation  parameter  k,  there  is  also  an  oscillation  parameter  b, 
with  a larger  b corresponding  to  a higher  frequency.  A father  wavelet  and  a 
mother  wavelet  correspond  to  Wb  with  b equal  to  0 and  1,  respectively:  Wo(t)  = 
(f>{t),wi{t)  = The  rest  of  the  wavelet  packet  functions  Wb{t),  6 = 2,3,  • • •,  are 
defined  as: 

Wb{t)  = V2J2f(k}w^,„{2t-k)  (6.12) 

k 

where  [xj  denotes  the  largest  integer  less  than  or  equal  to  x.  The  filter  f^{k)  is  either 
a lowpass  or  a highpass  filter  depending  on  the  value  of  b [58]: 


f g{k)  if  b mod  4 = 0 or  3 
( h\k)  if  b mod  4 = 1 or  2 


(6.13) 


where  g{k)  and  h{k)  are  the  lowpass  and  highpass  quadrature  mirror  filters  (QMF) 
associated  with  the  mother  wavelet  functions.  Using  the  wavelet  basis  functions,  the 
WPT  coefficients  can  then  be  calculated  as  the  inner  product  between  a signal  and 
the  corresponding  wavelet  basis  function: 


V/PT{j,b,k)  = J s{t)wjf,_k(t)dt 


(6.14) 
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The  wavelet  packet  decomposition  scheme  may  be  better  understood  with  the  aid 
of  the  two-channel  subband  coding  scheme  which  is  used  to  implement  the  DWT. 
Compared  with  DWT  where  at  each  level  only  the  lower  halfband  signal  is  further 
decomposed,  WPT  decomposes  the  higher  halfband  signal  as  well  (Figure  6.8(a)). 
If  we  retain  all  of  the  transform  coefficients  and  stack  them  together  in  the  order 
of  the  level,  a wavelet  packet  table  is  constructed.  Suppose  we  have  a signal  s{n) 
of  length  of  L,  with  L being  a multiple  of  2-^.  The  wavelet  packet  table  then  has 
J -I-  1 levels,  where  J is  the  maximum  possible  resolution  level.  At  the  resolution 
level  j,  the  table  has  L coefficients,  divided  into  2^  coefficient  blocks  indexed  by  j 
and  b,  and  usually  named  as  a node:  = [ Wj^b,i  '^j,b,2  ■ ■ ■ '^j,b,L/2j  ]•  Figure 

6.8(b)  shows  a layout  of  a wavelet  packet  table  with  3 resolution  levels.  The  level 
0 corresponds  to  the  original  signal.  We  see  that  after  WPT,  a signal  of  length  L 
ends  up  with  a maximum  of  J x L coefficients,  indicating  WPT  is  an  overcomplete 
transform.  Starting  from  this  table,  a particular  set  of  coefficients  can  be  selected  to 
form  a complete  and  orthogonal  transformation,  one  of  which  is  DWT  by  retaining 
the  coefficients  in  the  nodes  of  wi,i,  W2,i,  wa^o  and  wa^i.  In  general,  the  selection 
of  the  bases  is  usually  accompanied  by  the  optimization  of  a certain  cost  function, 
known  as  the  best  basis  method  [14], [66], [25].  However  for  signal  classification,  we  can 
argue  that  there  is  no  need  for  the  sought  bases  to  be  complete  and  orthogonal.  We 
only  need  to  determine  the  components  that  most  efficiently  encode  the  discriminant 
information  among  signal  classes.  In  this  way,  the  best  basis  selection  process  can  be 
directly  casted  into  a feature  selection  problem. 

Before  proceeding,  we  present  here  a toy  example  to  show  how  WPT  can 
be  used  to  extract  localized  features.  We  construct  the  full  feature  set  from  the 
wavelet  table  by  squaring  the  WPT  coefficients.  This  feature  has  an  intuitive  physical 
meaning.  Since  each  wavelet  basis  function  has  its  corresponding  coverage  in  the  TF 
plane,  it  can  be  viewed  as  a “window”  through  which  we  observe  a signal.  The  value 
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Figure  6.8:  (a)  Wavelet  packet  transform  with  h{n)  and  g{n)  being  a pair  of  QMF. 
(b)  Wavelet  packet  table.  Each  node  is  indexed  by  the  corresponding  wavelet  packet 
coefficients. 
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of  a feature  is  then  the  energy  of  a signal  through  that  “window”.  The  way  that 
the  wavelet  basis  functions  interpret  a signal  is  analogous  to  the  way  we  humans 
do  toward  a signal’s  TF  image.  To  be  more  clear,  let  us  revisit  the  problem  of 
extracting  time- frequency  localized  information  posed  in  the  previous  section.  If  we 
put  some  changeable  “windows”  on  the  image,  we  can  easily  describe  the  contents 
of  the  image  with  only  a few  WPT  coefficients  (Figure  6.9).  Of  course,  the  size  of 
a window  should  comply  with  the  uncertainty  principle.  Note  that  some  windows 
are  overlapped,  which  is  impossible  in  the  best  basis  method,  but  they  render  the 
information  description  more  efficiently.  The  above  example  is  only  an  illustration. 
The  selection  of  the  windows  or  bases  should  be  determined  by  signal  classes.  For 
example,  if  the  mines  and  clutter  have  the  same  resonant  component,  that  component 
will  not  be  selected. 

For  the  convenience  of  the  discussions  below,  we  denote  the  training  data  set 
to  be  X>  = {(x„,  yn)}n=i  £ x {±1}  with  a label  -(-1  being  a mine  and  a label  —1 
being  a clutter  sample,  and  let  (/> : R*  -)•  R',  with  h > I,  he  a feature  selector  which 
transforms  data  from  a high-dimensional  full  feature  space  into  a low-dimensional 
feature  subspace.  Whenever  without  confusion,  we  use  a vector  x to  denote  a mine 
sample  as  well  as  the  corresponding  WPT  pattern. 

6.5.2  Feature  Selection 

The  full  feature  set  constructed  from  the  wavelet  table  usually  has  a high 
dimensionality  relative  to  the  training  sample  size.  Since  most  of  the  signal  classi- 
fication systems  learn  the  system  parameters  in  a low  dimensional  space,  reducing 
the  data  dimensionality  while  selecting  the  most  salient  features  becomes  critically 
important.  Basically  there  are  two  methods,  feature  selection  and  feature  extraction 
[36],  that  can  be  employed  to  address  the  above  issue.  In  this  chapter,  we  only  focus 
on  the  methods  of  feature  selection  for  reducing  the  data  dimensionality. 


103 


Time 


Figure  6.9:  An  illustration  of  using  the  wavelet  windows  to  extract  localized  infor- 
mation contained  in  a TF  image.  The  size  of  a window  should  comply  with  the 
uncertainty  principle. 

The  problem  of  feature  selection  is  defined  as  follows:  given  a feature  set  A 
of  size  \et  S = {V  : V C X ,\V\  — 1}  with  / < (if  possible,  I <C  h)  and  denote  by 
V{V)  — {{^n^yn)}n=i  ^ K'  X {±1}  a training  data  set  constructed  from  the  feature 
subset  V,  a feature  selection  algorithm  finds  a subset  V*  e S such  that  a cost  function 
J(V{V*))  is  optimized,  i.e.. 


V*  = argmax  J(X>('P))  (6.15) 

Without  loss  of  generality,  we  assume  that  the  larger  the  cost  function,  the  better  the 
subset.  For  simplicity,  in  the  following  discussions  we  use  J{V)  instead  of  J{V{V)). 
Suppose  a suitable  cost  function  has  been  chosen  to  evaluate  the  goodness  of  the  can- 
didate feature  subset,  the  feature  selection  problem  is  reduced  to  a searching  problem. 
Although  an  exhaustive  search  guarantees  us  to  reach  the  optimal  subset,  it  requires 
examining  (^)  possible  candidate  subsets  and  consequently  it  is  computationally  pro- 
hibitive even  for  moderate  values  of  h and  1.  The  only  optimal  search  strategy  which 
avoids  an  exhaustive  search  is  the  Branch  and  Bound  (BB)  method  [30].  However, 
it  requires  the  cost  function  to  be  monotonic.  Though  the  monotonicity  condition  is 
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not  particularly  restrictive,  the  BB  method  does  not  work  well  for  a large  scale  prob- 
lem (when  d > 30  as  defined  in  [56]).  Therefore,  in  our  case  we  must  resort  to  some 
computationally  feasible  strategies  that  avoid  an  exhaustive  search  but  might  lead 
to  a suboptimal  solution.  One  example  is  the  Sequential  Forward  Selection  (SFS) 
method.  It  starts  from  an  empty  set  and  sequentially  adds  one  feature  at  a time 
which  when  combined  with  the  already  selected  features  maximizes  the  cost  function 
until  a predefined  feature  number  is  attained.  The  main  drawback  of  SFS  is  that  it  is 
unable  to  remove  a feature  once  it  is  retained  and  becomes  obsolete  after  the  inclusion 
of  other  features,  which  is  the  so-called  nesting  effect.  A more  sophisticated  search 
strategy  is  the  Sequential  Floating  Forward  Selection  (SFFS)  [57],  which  attempts  to 
overcome  the  nesting  problem  of  SFS  through  a flexible  backtracking.  The  algorithm 
is  summarized  as  follows; 


Sequential  Floating  Forward  Selection 

Initialization:  Full  feature  set  X,  = {0}>  predefined  feature  number  I,  k = 0 
while  k < I 

x+  = argmax^g^_y^J(Tfc  + {x}) 

3^fe+i  = T/c  + k = k + 1 

if  k > 2 

x~  = avgmax^^y^J{yk  - {x}) 

while  J(Tfc  - > J(Tjt-i)  and  k > 2 

yk-i  = yk-{x~y,k^k-i 
if  k > 2,x~  = argmax2,£3;^  J(Tfc  — {a:}),  end 

end 


end 
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end 


Although  SFFS  still  does  not  guarantee  the  optimality  of  the  selected  features, 
it  is  reported  that  the  flotation  SFFS  is  able  to  provide  a close  to  optimal  solution 
[57], 

Concerning  the  cost  functions,  in  this  chapter  we  consider  using  a cost  function 
based  on  the  linear  discriminant  analysis  (LDA).  Suppose  we  have  a data  set  V{V)  = 
{(x^,  eM!  X {±1}.  Then  the  within  class  scatter  matrix  and  the  between 

class  scatter  matrix  are  deflned  as  follows  [23]; 

Su;=  -m+)(x^-m+)^+  -m-)(x^ -m_f  (6.16) 

{n:j/„=+l}  {n:yn=-l} 

S(,  = (m+ - m_)(m+ - m_)^  (6-17) 

where  m4.  and  m_  are  the  mean  vectors  of  the  mine  samples  and  the  clutter  samples, 
respectively.  To  achieve  a good  class  separability,  we  need  to  And  a subset  such  that 
the  within  class  scatter  is  small  while  the  between  class  scatter  is  large.  One  possible 
scalar  measure  using  the  trace  of  a matrix  is: 

J = tr(S-'S5)  (6.18) 

Compared  with  the  wrapper  method  which  uses  the  classiflcation  performance  of  a 
certain  classifier  as  the  selection  criterion,  using  the  LDA  cost  function  has  a low 
computational  complexity.  However,  it  only  exploits  the  second  order  statistical 
information  and  attempts  to  select  the  features  with  unimodal  distributions.  Hence, 
even  if  the  exhaustive  search  method  is  used,  the  so-generated  feature  set  may  still 
be  suboptimal.  Furthermore,  the  problem  of  how  to  choose  the  feature  number  is 
still  open.  These  problems  can  be  mitigated  when  the  Boosting  algorithm  is  used  to 
select  features  adaptively. 
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6.5.3  Neural  Network 

With  the  selected  features,  a neural  network  (NNW)  classifier  is  designed. 
For  the  reason  which  will  be  clear  in  the  next  subsection,  we  train  the  network  by 
minimizing  the  cross-entropy  error  function  which  is  given  as  follows; 


where  G {±1}  and  G [0, 1]  are  the  target  value  and  the  output  of  the  network, 
respectively,  corresponding  to  the  input  x„.  With  the  output  activation  function 
being  chosen  to  be  the  logistic  function: 


it  can  be  shown  that  the  output  z„  of  the  network  is  the  estimate  of  the  posterior 
probability  of  x„  belonging  to  the  mine  class  , i.e.,  z„  = P{y  = -1-1  |x„)  [5]. 

Although  NNW  is  capable  of  providing  a nonlinear  mapping  for  the  training 
data,  in  the  training  stage  it  is  prone  to  being  stuck  in  local  minima  and  thus  reaching 
the  global  minima  is  not  guaranteed.  Moreover,  the  problem  of  finding  the  optimum 
NNW  structure,  i.e.,  specifying  the  numbers  of  hidden  units  and  hidden  layers,  is  not 
trivial.  It  is  usually  a process  of  trial-and-error  and  a large  amount  of  data  is  needed 
to  support  this  process.  Although  we  have  a large  number  of  clutter  data,  collecting 
mine  samples  is  expensive.  Therefore,  in  this  chapter,  our  strategy  is  to  design  a 
relatively  simple  classifier,  particularly  a NNW  classifier  with  a simple  structure,  and 
then  use  the  boosting  method  to  transform  the  weak  learner  into  a strong  learner. 
6.5.4  Boosting 

The  original  AdaBoost  [27]  uses  the  {±1}  valued  classification  functions,  i.e., 
/it(x)  : X — >■  {±1},  as  the  base  learners.  Schafire  et  al.  [69]  extended  it  to  a more 
general  version,  which  used  the  real-valued  functions  as  the  base  learners,  that  is, 
/i((x)  : X ->  [-1, 1]  with  sgn(fii(x))  being  the  class  label  and  |fit(x)]  the  classification 
confidence.  The  pseudo-code  of  AdaBoost  using  soft  decisions  is  presented  as  follows: 


(6.19) 
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AdaBoost  Using  Soft  Decision 

Initialization:  V — {(x„,  G R'  x {±1},  Maximum  iteration  number  T, 

= 1/N 

for  t = l:T 

1.  Training  weak  learner  with  respect  to  distribution  d^*)  and  get  hy- 
pothesis ht{x)  : X — > [—1, 1]. 

2.  Calculate  the  weighted  margin:  rt  — 

3.  Set  at  = |ln(^) 

4.  Update  weights: 

= flfW(n)  exp{-6itynht{xn))/Ct  (6.21) 

where  Q is  the  normalization  constant  such  that  Xln=i  = 1. 

end 

Output  : 

T 

t=i 


We  use  a neural  network  classifier  as  the  weak  learner  to  estimate  the  posterior 
probability  of  the  training  samples  and  thereby  the  weighted  margin  is  calculated. 
One  problem  associated  with  the  neural  network  is  the  possibility  of  the  network 
being  trapped  by  local  minima.  Without  using  the  boosting  method,  one  method 
suggested  in  the  literature  is  to  re-train  the  network.  Here,  however,  as  long  as  the 
weighted  margin  rt  > 0,  the  cost  function  always  directs  downhill.  Moreover,  with 
the  classifier  class  'H  being  negative  closed,  that  is,  h E means  that  -h  eH,  even 
in  the  worse  case  where  rt  < 0,  the  cost  function  can  still  be  reduced  by  the  classifier 
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being  chosen  to  be  — h((x)  instead  of  ht(x).  Therefore,  due  to  the  iterative  nature  of 
AdaBoost,  the  neural  network  being  trapped  by  local  minima  is  not  a problem  but 
only  changes  the  convergence  rate  of  the  cost  function. 

An  intuitive  idea  in  the  AdaBoost  is  that  the  examples  which  are  misclassified 
get  more  weights  in  the  next  iterations  (Eq.  (6.21))  and  hence  the  subsequent  clas- 
sifiers focus  more  and  more  on  those  harder  cases,  for  instance,  the  samples  near  the 
decision  boundary.  In  other  words,  the  subsequent  classifiers  attempt  to  modify  the 
decision  boundary  locally  in  the  data  space.  In  the  original  AdaBoost,  the  algorithm 
trains  an  ensemble  of  classifiers  based  on  the  training  data  with  different  distribu- 
tions but  with  the  same  features.  In  our  case,  we  reduce  the  data  dimensionality 
through  feature  selection  in  order  to  avoid  the  curse  of  dimensionality.  However,  the 
features  so-produced  are  aimed  at  optimizing  a certain  cost  function  based  on  the 
entire  training  data  set  other  than  being  in  favor  of  part  of  the  data  and  thus  may 
not  be  able  to  provide  sufficient  discriminant  information  for  these  harder  samples. 
This  observation  motivates  us  to  introduce  the  idea  of  re-extracting  features  adap- 
tively based  on  the  misclassified  samples  before  entering  the  next  iteration.  This 
process  is  depicted  in  Figure  6.10.  With  the  iterations,  one  may  expect  the  ensemble 
classifier  will  overfit  the  training  data  eventually.  The  regularized  AdaBoost  [63], [19] 
can  be  used  to  alleviate  the  problem.  However,  in  our  experiment,  due  to  the  good 
overfitting  resistance  of  AdaBoost,  the  ensemble  classifier  does  not  show  an  apparent 
overfitting  phenomena  even  after  200  iterations.  Shown  in  Figure  6.11  is  the  resulting 
ensemble  neural  network  classifier.  Note  that  due  to  the  feature  selection  process, 
most  of  the  connecting  weights  between  the  input  vectors  and  each  neural  network 
are  set  to  zero.  It  can  be  considered  as  an  ensemble  of  experts  making  the  decisions 
based  on  the  different  sets  of  features. 
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Figure  6.10:  The  training  process  of  the  ensemble  neural  network  classifier  using 
AdaBoost. 
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Figure  6.11:  The  resulting  ensemble  neural  network  classifier.  Due  to  the  feature 
selection  process,  most  of  the  connecting  weights  between  the  input  vectors  and  each 
neural  network  are  set  to  zeros. 
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6.6  Experimental  Results 

To  demonstrate  the  effectiveness  of  the  proposed  classifier,  the  experimental 
results  based  on  the  measured  FLGPR  data  are  presented.  We  totally  collect  133 
mine  chips  and  3962  clutter  chips,  in  which  113  mines  and  3462  clutter  are  randomly 
selected  as  the  training  data  and  the  rest  as  the  testing  data.  Note  that  the  training 
data  for  the  two  classes  are  highly  unbalanced.  We  hence  take  five  down-range  profile 
signals  through  the  center  of  each  mine  chip  to  augment  the  mine  data  set,  which 
leads  to  565  and  100  samples,  respectively,  in  the  mine  training  set  and  testing  set.  A 
Daublet  10  wavelet  filter  is  used  to  decompose  the  signals  into  the  WPT  coefficients 
and  then  SFFS  with  LDA  cost  function  finds  10  features  from  the  WPT  table.  With 
the  selected  features,  a multilayer  neural  network  classifier  is  designed.  The  structure 
of  the  network  is  quite  simple:  it  has  10  input  units,  5 hidden  units  and  1 output  units. 
The  sigmoid  function  and  the  logistic  function  are  used  in  the  hidden  layer  and  output 
layer,  respectively,  as  the  activation  functions.  We  train  the  network  to  minimize  the 
cross-entropy  error  function.  In  term  of  the  network  outputs,  a weighted  margin  is 
calculated  and  the  training  data  distribution  is  updated.  Before  entering  the  next 
iteration,  a new  set  of  features  is  re-extracted  based  on  the  training  data  with  the 
updated  distribution.  The  above  procedure  is  iterated  until  the  maximum  iteration 
number  is  reached.  The  final  decision  is  calculated  as  the  weighted  combination  of  the 
decisions  of  the  base  learners.  We  totally  conduct  10  experiments.  The  10  training 
and  testing  results  are  averaged  and  plotted  in  Figure  6.12  . As  we  can  expect,  the 
training  errors  are  continuously  reduced  with  the  increase  of  the  iteration  number  and 
reach  zero  when  about  50  classifiers  are  included  in  the  ensemble  classifier.  In  general, 
one  may  not  be  interested  in  making  the  empirical  error  zero  due  to  the  overfitting 
concerns.  However,  as  we  can  see  in  Figure  6.12,  the  ensemble  classifier  presents  a 
very  impressive  generalization  capability.  With  the  inclusion  of  more  classifiers,  the 
receiver  operation  characteristic  (ROC)  curve  of  the  testing  results  are  continuously 


Ill 


Training  Resuit 


(a) 


Testing  Resuit 


(b) 


Figure  6.12:  The  training  and  testing  results  for  the  ensemble  classifier  with  the 
feature  selection  being  integrated  into  each  iteration.  The  numbers  in  the  legend 
are  the  iteration  numbers.  The  first  iteration  corresponds  to  the  classifier  without 
AdaBoost. 


pushed  toward  the  upperleft  corner  and  is  saturated  when  the  number  of  ensemble 
members  reaches  80.  Interestingly,  the  ensemble  classifier  does  not  show  an  apparent 
overfitting  phenomena  even  when  200  classifiers  are  combined. 

For  comparison,  the  results  for  the  ensemble  classifier  which  has  no  feature 
selection  module  integrated  in  each  iteration  are  also  presented  (Figure  6.13).  That 
is,  during  the  boosting  procedure,  we  do  not  re-extract  features  adaptively  for  the 
misclassified  training  samples.  Again,  the  training  errors  are  reduced  to  zero  with 
the  increase  of  iteration  number  and  the  testing  results  show  that  the  ensemble  clas- 
sifier has  a good  generalization  capability,  which  however,  is  much  worse  than  that 
of  the  ensemble  classier  with  adaptive  feature  selection  (Figure  6.14).  It  indicates 
that  the  AdaBoost  algorithm  with  the  feature  selection  being  integrated  effectively 
extracts  the  discriminant  information  and  at  the  same  time  controls  the  side  effect 
of  overfitting. 
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Figure  6.13:  The  training  and  testing  results  for  the  ensemble  classifier  without  the 
feature  selection  being  integrated  into  each  iteration.  The  numbers  in  the  legend 
are  the  iteration  numbers.  The  first  iteration  corresponds  to  the  classifier  without 
AdaBoost. 


Testing  Result  (200  iterations) 


Figure  6.14:  The  ensemble  classifier  with  adaptive  feature  selection  can  improve  the 
testing  performance  significantly  over  the  classifier  without  adaptive  feature  selection. 
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6.7  Conclusions 

In  this  chapter,  we  have  developed  a landmine  detector  by  using  the  wavelet 
packet  transform  and  machine  learning  approach.  Through  the  time-frequency  anal- 
ysis, we  have  found  that  most  of  the  discriminant  information  between  signal  classes 
is  time-frequency  localized.  This  observation  has  motivated  us  to  use  the  wavelet 
packet  transform  to  sparsely  represent  the  signals  with  the  discriminant  information 
encoded  into  several  bases.  The  SFFS  with  the  LDA  cost  function  has  been  used  to 
extract  these  components.  With  the  extracted  features,  a neural  network  classifier 
has  been  designed.  In  order  to  further  improve  the  classification  performance  and  deal 
with  the  problem  of  unlimited  possibilities  of  clutter,  the  AdaBoost  algorithm  has 
been  used.  We  have  introduced  the  idea  of  integrating  the  feature  selection  module 
into  each  iteration  of  the  AdaBoost  algorithm.  The  experimental  results  have  shown 
that  with  the  proposed  classifier,  significant  improvement  on  both  the  training  and 
testing  performances  can  be  achieved.  The  extension  of  our  classification  scheme  to 
general  object  recognition  problems  is  possible. 


CHAPTER  7 

CONCLUSIONS  AND  FUTURE  WORK 

7.1  Conclusions 

In  the  first  part  of  the  work,  we  have  made  a detailed  study  on  AdaBoost  and 
its  regularized  variations.  Motivated  by  the  connection  between  AdaBoost  and  lin- 
ear programming,  we  have  proposed  several  regularized  boosting  algorithms.  These 
algorithms  implement  an  intuitive  idea  of  controlling  the  distribution  skewness  in 
the  learning  process  to  prevent  the  outlier  samples  from  spoiling  decision  boundaries 
by  introducing  a smooth  convex  penalty  function  into  the  objective  of  the  minimax 
problem.  Large-scale  experiments  based  on  the  UCI,  DELVE,  STATLOG  and  USPS 
datasets  have  been  performed.  Our  regularized  boosting  algorithms  can  effectively 
alleviate  the  overfitting  problem  of  AdaBoost.  Compared  with  other  existing  reg- 
ularized boosting  algorithms,  our  methods  can  achieve  at  least  the  same  or  much 
better  performances.  The  results  on  the  USPS  dataset  have  also  demonstrated  that 
our  algorithms  are  very  robust  against  class  mislabeling  and  feature  noise.  We  have 
also  extended  our  derivations  to  multiclass  problems  and  have  proposed  several  reg- 
ularized multiclass  boosting  algorithms. 

In  the  second  part  of  this  work,  we  have  developed  a landmine  detecting  sys- 
tem by  using  the  time-frequency  analysis  and  AdaBoost  approach.  Through  the 
time-frequency  analysis,  we  have  obtained  a good  understanding  of  the  scattering 
phenomena  for  mines  and  clutter  and  found  that  the  most  discriminant  informa- 
tion between  signal  classes  is  time-frequency  localized.  Based  on  this  observation,  a 
wavelet  packet  transform  based  detector  has  been  proposed.  Various  feature  selec- 
tion strategies  as  well  as  classification  schemes  have  been  investigated.  The  proposed 
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detector  has  been  tested  in  the  Army  blind  tests,  showing  a very  promising  detection 
performance.  In  order  to  further  improve  the  classification  performance  and  deal 
with  the  problem  of  unlimited  possibilities  of  clutter,  the  AdaBoost  algorithm  has 
been  investigated.  We  have  shown  experimentally  that  with  the  feature  selection 
being  integrated  into  the  AdaBoost  iterations,  AdaBoost  can  significantly  improve 
both  the  training  and  testing  performances. 

7.2  Future  Work 

The  future  work  can  be  divided  into  several  aspects: 

(1)  AdaBoost  can  be  roughly  understood  as  a learn-from-errors  algorithm,  which 
mimics  the  human  learning  process.  However,  it  fails  to  address  another  aspect 
of  the  human  learning  process,  i.e.,  we  usually  give  up  some  samples  if  they  are 
too  hard  to  learn.  Through  the  control  of  the  distribution  skewness,  we  inte- 
grate this  missing  part  into  the  AdaBoost  algorithm.  It  is  a brand-new  concept 
of  regularization.  We  refer  to  it  as  adaptive  regularization.  The  regularized 
boosting  algorithms  proposed  based  on  this  idea  perform  very  well  in  the  ex- 
periments. In-depth  theoretical  studies  along  this  direction  are  warranted  and 
should  be  the  subject  of  future  research. 

(2)  The  idea  of  adaptive  regularization  can  also  be  applied  to  the  regression  scenario 
and  further  investigations  are  warranted. 

(3)  Further  developments  of  regularized  multiclass  boosting  algorithms  are  desired. 
Their  applications  to  real-world  datasets  as  well  as  empirical  comparisons  with 
other  regularized  multiclass  boosting  algorithms  are  needed. 

(4)  Finally,  on  the  landmine  detection,  due  to  the  nature  of  the  landmine  detection 
problem,  which  is  similar  to  the  still  open  face  detection  problem,  the  key  to 
the  success  of  a landmine  detection  system  is  to  exploit  domain  knowledge  to  a 
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maximum  extent  and  meanwhile  avoid  the  overfitting.  We  have  developed  a set 
of  machine  learning  tools.  Careful  adaptations  and  applications  of  these  tools 
to  landmine  detection  are  required.  The  effective  and  efficient  combination  of 
feature  selection  and  machine  learning  also  deserves  much  attention. 


APPENDIX  A 


Notations 

II  • lip  p-norm 

p margin  of  classifier 

p(x)  margin  of  samplex 

oc  combination  coeffcients 

a unnormalized  combination  coeffcients:  a = o:/||d||i 

X data  sample 

y class  label 

X sample  sapce 

3^  label  space 

hypothesis  function  set,  cardinality  of  set  'H 
h hypothesis  function,  base  learner 

n,  N data  index,  number  of  patterns 

c,  C class  index,  number  of  class 

t,  T AdaBoost  iteration,  maximum  iteration 

P^  distribution  simplex 

M code  matrix 

M{c)  row  of  code  matrix 

Z gain  matrix 

z.f  column  of  gain  matrix 

z„,  row  of  gain  matrix 

Pr{A)  probability  of  event  A 

d,  D querry  distribution 

F(x)  ensemble  classifier 

/(x)  normalized  ensemble  classifier 

!(•)  indicator  function 


117 


REFERENCES 


[1]  Delve  databases,  http://www.cs.utoronto.ca/  delve/,  1998. 

[2]  Adopt-a-minefield,  http:/ /www. landmine.org,  2000. 

[3]  E.  L.  Allwein,  R.  E.  Schapire,  and  Y.  Singer.  Reducing  multiclass  to  binary:  A 
unifying  approach  for  margin  classifiers.  Journal  of  Machine  Learning  Research, 
1:113-141,  December  2000. 

[4]  E.  Baucer  and  R.  Kohavi.  An  empirical  comparison  of  voting  classification 
algorithms:  bagging,  boosting,  and  variants.  Machine  Learning,  36:105-139, 
1999. 

[5]  C.  Bishop.  Neural  Networks  for  Pattern  Recognition.  New  York:  Oxford,  1995. 

[6]  C.  Blake  and  C.  Merz.  UCI  repository  of  machine  learning  databases, 
http://www.ics.uci.edu/  mlearn/mlrepository.html,  1998. 

[7]  B.  Boashash.  Note  on  the  use  of  the  Wigner  distribution  for  time-frequency 
signal  analysis.  IEEE  Transactions  on  Acoustics,  Speech  and  Signal  Processing, 
36(9):1518-1521,  September  1988. 

[8]  L.  Breiman.  Bagging  predictors.  Machine  Learning,  24(2):123-140,  1996. 

[9]  L.  Breiman.  Arcing  classifiers.  The  Annals  of  Statistics,  26(3):801-849,  1998. 

[10]  L.  Breiman.  Prediction  games  and  arcing  algorithms.  Neural  Computation, 
11(7):1493-1517,  October  1999. 

[11]  L.  Carin,  N.  Geng,  and  M.  McClure.  Ultra- wide-band  Synthetic- Aperture  radar 
for  mine-field  detection.  IEEE  Antennas  and  Propagation  Magazine,  41(1):18- 
33,  February  1999. 

[12]  V.  Chen  and  H.  Ling.  Joint  time-frequency  analysis  for  radar  signal  and  image 
processing.  IEEE  Signal  Processing  Magazine,  16(2):81-93,  March  1999. 

[13]  V.  Chen  and  H.  Ling.  Time- Frequency  Transforms  for  Radar  Imaging  and  Signal 
Analysis.  Artech  House,  Boston,  London,  2002. 

[14]  R.  Coifman  and  M.  Wickerhauser.  Entropy-based  algorithms  for  best  basis  se- 
lection. IEEE  Transactions  on  Information  Theory,  38(2):713-718,  March  1992. 

[15]  C.  Cortes  and  V.  Vapnik.  Support  vector  networks.  Machine  Learning, 
20(3):273-297,  September  1995. 

[16]  R.  Cosgrove,  P.  Milanfar,  and  J.  Kositsky.  Trained  detection  ofburied  mines  in 
sar  images  via  the  defiection  optimal  criterion.  IEEE  Transactions  on  Geoscience 
and  Remote  Sensing,  submitted  2003. 


118 


119 


[17]  T.  M.  Cover  and  J.  A.  Thomas.  Elements  of  Information  Theory.  Wiley  Inter- 
science Press,  New  York,  1991. 

[18]  K.  Crammer  and  Y.  Singer.  On  the  learnability  and  design  of  output  codes  for 
multiclass  problems.  In  Computational  Tearing  Theory,  pages  35-46,  2000. 

[19]  A.  Demiriz,  K.  P.  Bennett,  and  J.  Shawe-Taylor.  Linear  programming  boosting 
via  column  generation.  Machine  Learning,  46(1-3) :225-254,  2002. 

[20]  T.  G.  Dietterich.  Machine  learning  research:  Four  current  direction.  AI  Maga- 
zine, 18(4):97-136,  1997. 

[21]  T.  G.  Dietterich.  An  experimental  comparison  of  three  methods  for  constructing 
ensembles  of  decision  trees:  Bagging,  boosting,  and  randomization.  Machine 
Learning,  40:139-157,  2000. 

[22]  T.  G.  Dietterich  and  G.  Bakiri.  Solving  multiclass  learning  problems  via  error- 
correcting  output  codes.  Journal  of  Artificial  Intelligence  Research,  pages  263- 
286,  1995. 

[23]  R.  Duda,  P.  Hart,  and  D.  Stork.  Pattern  Classification.  J.  Wiley,  New  York, 
2000. 

[24]  I.  Ekeland  and  R.  Temam.  Convex  Analysis  and  Variational  Problems.  North- 
Holland  Pub.  Go.,  Amsterdam,  1976. 

[25]  K.  Etemad  and  R.  Chellappa.  Separability-based  multiscale  basis  selection  and 
feature  extraction  for  signal  and  image  classification.  IEEE  Transactions  on 
Image  Processing,  7(10),  October  1998. 

[26]  Y.  Freund.  An  adaptive  version  of  the  boost  by  majority  algorithm.  Machine 
Learning,  pages  293-318,  June  2001. 

[27]  Y.  Freund  and  R.  E.  Schapire.  A decision-theoretic  generalization  of  on-line 
learning  and  an  application  to  boosting.  In  European  Conference  on  Computa- 
tional Learning  Theory,  pages  23-37,  1995. 

[28]  Y.  Freund  and  R.  E.  Schapire.  Game  theory,  on-line  prediction  and  boosting. 
Proceedings  of  the  ninth  annual  conference  on  Computational  learning  theory, 

[29]  J.  Friedman,  T.  Hastie,  and  R.  Tibshirani.  Additive  logistic  regression:  a sta- 
tistical view  of  boosting,  1998. 

[30]  K.  Fukunaga.  Statistical  Pattern  Recognition,  2nd  ed.  Academic,  New  York, 
1990. 

[31]  A.  J.  Grove  and  D.  Schuurmans.  Boosting  in  the  limit:  Maximizing  the  margin 
of  learned  ensembles.  In  AAAI/IAAI,  pages  692-699,  1998. 

[32]  K.  Gu,  J.  Li,  M.  Bradley,  J.  Habersat,  and  G.  Maksymonko.  Adaptive  ground 
bounce  removal.  Proceedings  of  the  2002  SPIE  International  Symposium  on 
Optical  Engineering  in  Aerospace  Sensing,  4742:719-727,  April  2002. 

[33]  V . Guruswami  and  A.  Sahai.  Multiclass  learning,  boosing,  and  error-correcting 
codes.  In  Proc.  of  the  Twelfth  Annual  Conference  on  Computational  Learning 
Theory,  pages  145-155.  AGM  Press,  New  York,  USA,  1999. 


120 


[34]  T.  Hastie  and  R.  Tibshirani.  Classification  by  pairwise  coupling.  In  M.  I.  Jor- 
dan, M.  J.  Kearns,  and  S.  A.  Sofia,  editors.  Advances  in  Neural  Information 
Processing  Systems,  volume  10.  The  MIT  Press,  1998. 

[35]  A.  K.  Hocaoglu,  P.  Gader,  J.  M.  Keller,  and  B.  Nelson.  Anti-personnel  land 
mine  detection  and  discrimination  using  acoustic  data.  Journal  of  Subsurface 
Sensing  Technologies  and  Application,  3(2):75-93,  April  2002. 

[36]  A.  Jain  and  D.  Zongker.  Feature  selection:  evaluation,  application,  and  small 
sample  performance.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  In- 
telligence, 19(2),  February  1997. 

[37]  W.  Jiang.  Does  boosting  overfit:  Views  from  an  exact  solution.  Technical  Report 
00-03,  Department  of  Statistics,  Northwestern  University,  2000. 

[38]  W.  Jiang.  Some  theoretical  aspects  of  boosting  in  the  presence  of  noisy  data.  In 
Proc.  18th  International  Conf.  on  Machine  Learning,,  pages  234-241.  Morgan 
Kaufmann,  San  Francisco,  CA,  2001. 

[39]  W.  Jiang.  Is  regularization  unnecessary  for  boosting.  Proc.  18th  International 
Workshop  on  Artificial  Intelligence  and  Statistics,  to  appear. 

[40]  J.  Kivinen  and  M.  K.  Warmuth.  Boosting  as  entropy  projection.  In  Computa- 
tional hearing  Theory,  pages  134-144,  1999. 

[41]  J.  Kositsky  and  C.  Amazeen.  Result  from  a forward-looking  GPR  mine  detection 
system.  Proceedings  of  SPIE  on  Detection  and  Remediation  Technologies  for 
Mine  and  Minelike  Targets  VI,  4394:700-711,  April  2001. 

[42]  H.  Ling,  J.  Moore,  D.  Bouche,  and  V.  Saavedra.  Time-frequency  analysis  of 
backscattered  data  from  a coated  strip  with  a gap.  IEEE  Transactions  on  An- 
tennas and  Propagation,  41(8):1147-1150,  August  1993. 

[43]  G.  Liu,  Y.  Jiang,  and  J.  Li.  Radio  frequency  interference  suppression  for  land- 
mine detection  by  quadrupole  resonance.  IEEE  Transactions  on  Geoscience  and 
Remote  Sensing,  submitted  2003. 

[44]  S.  Mallat  and  Z.  Zhang.  Matching  pursuits  with  time-frequency  dictionaries. 
IEEE  Transactions  on  Signal  Processing,  41(12):3397-3415,  1993. 

[45]  R.  E.  Marsten,  W.  W.  Hogan,  and  J.  W.  Blankenship.  The  BOXSTEP  method 
for  large-scale  optimization.  Operations  Research,  23:389-405,  1975. 

[46]  A.  Martinez  and  A.  Kak.  PGA  versus  LDA.  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence,  23(2):228-233,  February  2001. 

[47]  L.  Mason,  J.  Bartlett,  P.  Baxter,  and  M.  Frean.  Functional  gradient  techniques 
for  combining  hypotheses.  In  B.  Scholkopf,  A.  Smola,  P.  Bartlett  and  D.  Schu- 
urmans,  editors.  Advances  in  Large  Margin  Classifiers,  2000. 

[48]  R.  Meir.  Support  vector  machines:  an  introduction.  Lecture  Notes,  June  2002. 

[49]  R.  Meir  and  G.  Ratsch.  An  introduction  to  boosting  and  leveraging.  In  S. 
Mendelson  and  A.  Smola,  editors.  Advanced  Lectures  on  Machine  Learning, 
LNCS,  Springer,  pages  119-184,  2003. 


121 


[50]  A.  Mohan,  C.  Papageorgiou,  and  T.  Poggio.  Example-based  object  detection  in 
images  by  components.  IEEE  Transactions  on  Pattern  Analysis  and  Machine 
Intelligence,  23(4):349-361,  2001. 

[51]  S.  G.  Nash  and  A.  Sofer.  Linear  and  Nonlinear  Programming.  McGraw-Hill, 
1996. 

[52]  P.  Neame.  Nonsmooth  dual  methods  in  integer  programming.  PhD  Thesis, 
University  of  Melbourne,  Australia,  1999. 

[53]  T.  Onoda,  G.  Ratsch,  and  K.  R.  Muller.  A non-intrusive  monitoring  system 
for  household  electric  appliances  with  inverters.  In  Proc.  of  NC’2000.  Academic 
Press  Canada/Switzerland,  2000. 

[54]  M.  Oren,  C.  Papageorgiou,  P.  Sinha,  E.  Osuna,  and  T.  Poggio.  Pedestrian 
detection  using  wavelet  templates.  Conference  on  Computer  Vision  and  Pattern 
Recognition,  Puerto  Rico,  June  17-19  1997. 

[55]  C.  Papageorgiou,  M.  Oren,  and  T.  Poggio.  A general  framework  for  object  de- 
tection. Proceedings  of  International  Conference  on  Computer  Vision,  Bombay, 
India,  January  1998. 

[56]  P.  Pudil  and  J.  Novovicova.  Novel  methods  for  subset  selection  with  respect  to 
problem  knowledge.  IEEE  Intelligent  Systems,  13(2),  Mar/Apr  1998. 

[57]  P.  Pudil,  J.  Novovicova,  and  J.  Kittler.  Floating  search  methods  in  feature 
selection.  Pattern  Recognigion  Letters,  pages  1119-1125,  November  1994. 

[58]  S.  Qian.  Introduction  to  Time- Frequency  and  Wavelet  Transforms.  Prentice  Hall 
PTR,  New  Jersey,  2002. 

[59]  S.  Qian  and  D.  Chen.  Joint  time-frequency  analysis.  IEEE  Signal  Processing 
Magazine,  16(2):52-67,  March  1999. 

[60]  J.  Quinlan.  Bagging,  boosting,  and  C4.5.  In  In  Proceedings  of  the  Thirteenth 
National  Conference  on  Artificial  Intelligence  and  the  Eighth  Innovative  Appli- 
cations of  Artificial  Intelligence  Conference,  pages  725-730.  AAAI  Press  / MIT 
Press,  1996. 

[61]  J.  R.  Quinlan.  Cf.S:  Programs  for  Machine  Learning.  Morgan  kaufmann,  1993. 

[62]  G.  Ratsch.  Robust  boosting  via  convex  optimization:  theory  and  application. 
PhD  Thesis,  University  of  Potsdam,  Germany,  October  2001. 

[63]  G.  Ratsch,  T.  Onoda,  and  K.-R.  Muller.  Soft  margins  for  AdaBoost.  Machine 
Learning,  42(3):287-320,  March  2001. 

[64]  G.  Ratsch,  B.  Scholkopf,  A.  Smola,  S.  Mika,  T.  Onoda,  and  K.-R.  Muller.  Robust 
ensemble  learning.  In  B.  Scholkopf  A.  Smola,  P.  Bartlett  and  D.  Schuurmans, 
editors.  Advances  in  Large  Margin  Classifiers,  pages  207-220,  2000. 

[65]  G.  Ratsch,  M.  Warmuth,  S.  Mika,  T.  Onoda,  S.  Lemm,  and  K.-R.  Muller.  Barrier 
boosting.  Proceedings  of  the  Thirteenth  Annual  Conference  on  Computational 
Learning  Theory,  2000. 


122 


[66]  N.  Saito  and  R.  Coifman.  Local  discriminant  bases.  In  A.  F.  Laine  and  M.  A. 
Unser,  editors,  Wavelet  Applications  in  Signal  and  Image  Processing  II,  Proc. 
SPIE  2303,  pages  2-14,  1994. 

[67]  R.  Schapire.  Using  output  codes  to  boost  multiclass  learning  problems.  In  Ma- 
chine Learning:  Proceedings  of  the  Fourteenth  International  Conference,  pages 
313-321,  1997. 

[68]  R.  Schapire.  The  boosting  approach  to  machine  learning:  An  overview.  In  MSRI 
Workshop  on  Nonlinear  Estimation  and  Classification,  March  2001. 

[69]  R.  Schapire  and  Y.  Singer.  Improved  boosting  algorithms  using  confidence-rated 
predictions.  Machine  Learning,  pages  297-336,  December  1999. 

[70]  R.  E.  Schapire,  Y.  Freund,  P.  Bartlett,  and  W.  S.  Lee.  Boosting  the  margin:  a 
new  explanation  for  the  effectiveness  of  voting  methods.  Proc.  Ifth  International 
Conference  on  Machine  Learning,  pages  322-330,  1997. 

[71]  R.  E.  Schapire  and  Y.  Singer.  Boostexter:  A boosting-based  system  for  text 
categorization.  Machine  Learning,  39(2/3):135-168,  2000. 

[72]  Y.  Sun  and  J.  Li.  Time-frequency  analysis  for  plastic  mine  detection  via  forward- 
looking  ground  penetrating  radar.  lEE  Proceedings- Radar,  Sonar  and  Naviga- 
tion, Special  Issue  on  Time- frequency  Analysis  for  Synthetic  Aperture  Radar 
and  Feature  Extraction,  150(4)  :253-261,  August  2003. 

[73]  Y.  Sun  and  J.  Li.  An  adaptive  learning  approach  to  landmine  detection.  IEEE 
Transactions  on  Aerospace  and  Electronic  Systems,  to  appear  2004. 

[74]  Y.  Sun  and  J.  Li.  A machine  learning  approach  to  landmine  detection  via 
forward-looking  Ground  Penetrating  Radar.  IEEE  Transactions  on  Geoscience 
and  Remote  Sensing,  submitted  2004. 

[75]  Y.  Sun  and  J.  Li.  Robust  linear  programming  boosting  algorithm.  IEEE  Trans- 
actions on  Signal  Processing,  submitted  2004. 

[76]  Y.  Sun,  J.  Li,  and  W.  Hager.  Regularized  AdaBoost  algorithms.  Journal  of 
Machine  Learning  Research,  submitted  2004. 

[77]  Y.  Sun,  J.  Li,  and  W.  Hager.  Two  new  regularized  AdaBoost  algorithms.  Inter- 
national Conference  on  Machine  Learning  and  Applications,  to  appear  2004. 

[78]  V.  Vapnik.  Statistical  Learning  Theory.  John  Wiley  and  Sons  Inc.,  New  York, 
1998. 

[79]  J.  von  Neumann.  Zur  theorie  der  gesellschaftsspiele.  Math.  Ann.,  pages  100:295- 
320,  1928. 

[80]  R.  Wu,  A.  Clement,  J.  Li,  and  E.  Larsson.  Adaptive  ground  bounce  removal. 
lEE  Electronics  Letters,  37(20):1250-1252,  September  2001. 


BIOGRAPHICAL  SKETCH 


Yijun  Sun  received  B.S.  degrees  in  both  electrical  and  mechanical  engineering 
from  Shanghai  Jiao  Tong  University,  Shanghai,  China,  in  1995  and  the  M.S.  degree 
in  electrical  engineering  in  2003  from  the  University  of  Florida,  Gainesville,  where 
he  is  currently  pursuing  the  Ph.D.  degree.  From  1995  to  2000,  he  was  an  electrical 
engineer  with  Siemens,  Shanghai,  China.  Since  2000,  he  has  been  a research  assis- 
tant in  the  Department  of  Electrical  and  Computer  Engineering  at  the  University 
of  Florida,  Gainesville.  His  current  research  interests  include  pattern  recognition, 
machine  learning,  time-frequency  and  wavelet  analysis,  as  well  as  their  applications 
to  landmine  detection  using  ground  penetrating  radar. 


123 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosqj 


Jian  Li,  Chairman 


Professor  of  Electrical  and 
Computer  Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Jianbo  Gao 

Assistant  Professor  of  Electrical  and 
Computer  Engineering 


1 certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


C. 

Kenneth  C.  Slatton 
Assistant  Professor  of  Electrical  and 
Computer  Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


William  Hager 
Professor  of  Mathematics 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  College  of 
Engineering  and  to  the  Graduate  School  and  was  accepted  as  partial  fulfillment  of 
the  requirements  for  the  degree  of  Doctor  of  Philosophy. 

December  2004  

Pramod  P.  Khargonekar 
Dean,  College  of  Engineering 


Kenneth  J.  Gerhardt 
Interim  Dean,  Graduate  School 


